Computation and Language 186
☆ How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs
Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, Liwei Wang
Despite the remarkable success of Transformer-based Large Language Models
(LLMs) across various domains, understanding and enhancing their mathematical
capabilities remains a significant challenge. In this paper, we conduct a
rigorous theoretical analysis of LLMs' mathematical abilities, with a specific
focus on their arithmetic performances. We identify numerical precision as a
key factor that influences their effectiveness in mathematical tasks. Our
results show that Transformers operating with low numerical precision fail to
address arithmetic tasks, such as iterated addition and integer multiplication,
unless the model size grows super-polynomially with respect to the input
length. In contrast, Transformers with standard numerical precision can
efficiently handle these tasks with significantly smaller model sizes. We
further support our theoretical findings through empirical experiments that
explore the impact of varying numerical precision on arithmetic tasks,
providing valuable insights for improving the mathematical reasoning
capabilities of LLMs.
☆ Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang, Xi Feng, Yuelin Bai, Xinrun Du, Jinchang Hou, Kaixin Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao Huang, Chenghua Lin, Ge Zhang, Shiwen Ni
As the capabilities of Multimodal Large Language Models (MLLMs) continue to
improve, the need for higher-order capability evaluation of MLLMs is
increasing. However, there is a lack of work evaluating MLLM for higher-order
perception and understanding of Chinese visual content. To fill the gap, we
introduce the **C**hinese **I**mage **I**mplication understanding
**Bench**mark, **CII-Bench**, which aims to assess the higher-order perception
and understanding capabilities of MLLMs for Chinese images. CII-Bench stands
out in several ways compared to existing benchmarks. Firstly, to ensure the
authenticity of the Chinese context, images in CII-Bench are sourced from the
Chinese Internet and manually reviewed, with corresponding answers also
manually crafted. Additionally, CII-Bench incorporates images that represent
Chinese traditional culture, such as famous Chinese traditional paintings,
which can deeply reflect the model's understanding of Chinese traditional
culture. Through extensive experiments on CII-Bench across multiple MLLMs, we
have made significant findings. Initially, a substantial gap is observed
between the performance of MLLMs and humans on CII-Bench. The highest accuracy
of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an
impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional
culture images, suggesting limitations in their ability to understand
high-level semantics and lack a deep knowledge base of Chinese traditional
culture. Finally, it is observed that most models exhibit enhanced accuracy
when image emotion hints are incorporated into the prompts. We believe that
CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics
and Chinese-specific images, advancing the journey towards expert artificial
general intelligence (AGI). Our project is publicly available at
https://cii-bench.github.io/.
comment: 32 pages,18 figures. Project Page: https://cii-bench.github.io/ Code:
https://github.com/MING_X/CII-Bench Dataset:
https://huggingface.co/datasets/m-a-p/CII-Bench
☆ Retrospective Learning from Interactions
Multi-turn interactions between large language models (LLMs) and users
naturally include implicit feedback signals. If an LLM responds in an
unexpected way to an instruction, the user is likely to signal it by rephrasing
the request, expressing frustration, or pivoting to an alternative task. Such
signals are task-independent and occupy a relatively constrained subspace of
language, allowing the LLM to identify them even if it fails on the actual
task. This creates an avenue for continually learning from interactions without
additional annotations. We introduce ReSpect, a method to learn from such
signals in past interactions via retrospection. We deploy ReSpect in a new
multimodal interaction scenario, where humans instruct an LLM to solve an
abstract reasoning task with a combinatorial solution space. Through thousands
of interactions with humans, we show how ReSpect gradually improves task
completion rate from 31% to 82%, all without any external annotation.
★ Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
In this paper, we introduce Janus, an autoregressive framework that unifies
multimodal understanding and generation. Prior research often relies on a
single visual encoder for both tasks, such as Chameleon. However, due to the
differing levels of information granularity required by multimodal
understanding and generation, this approach can lead to suboptimal performance,
particularly in multimodal understanding. To address this issue, we decouple
visual encoding into separate pathways, while still leveraging a single,
unified transformer architecture for processing. The decoupling not only
alleviates the conflict between the visual encoder's roles in understanding and
generation, but also enhances the framework's flexibility. For instance, both
the multimodal understanding and generation components can independently select
their most suitable encoding methods. Experiments show that Janus surpasses
previous unified model and matches or exceeds the performance of task-specific
models. The simplicity, high flexibility, and effectiveness of Janus make it a
strong candidate for next-generation unified multimodal models.
comment: Technical Report
☆ SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction
Recent advancements in large language models (LLMs) have extended their
capabilities to handle long contexts. However, increasing the number of model
layers and the length of input sequences significantly escalates the memory
required to store key-value (KV) cache, posing challenges for efficient
inference. To mitigate this issue, we present SimLayerKV, a simple yet
effective method that reduces inter-layer KV cache redundancies by selectively
dropping cache in identified lazy layers. Our approach is based on the
observation that certain layers in long-context LLMs exhibit "lazy" behavior,
contributing less to modeling long-range dependencies compared to non-lazy
layers. By analyzing attention weight patterns, we find that the behavior of
these lazy layers is consistent across tokens during generation for a given
input. This insight motivates our SimLayerKV, which identifies lazy layers and
reduces their KV cache accordingly. SimLayerKV is training-free, generalizable,
and can be implemented with only seven lines of code. We conduct extensive
experiments on three representative LLMs, e.g., LLaMA2-7B, LLaMA3-8B, and
Mistral-7B across 16 tasks from the LongBench benchmark. The results
demonstrate that SimLayerKV achieves a KV cache compression ratio of 5$\times$
with only a 1.2% performance drop when combined with 4-bit quantization. Our
code is available at https://github.com/sail-sg/SimLayerKV.
☆ A Unified View of Delta Parameter Editing in Post-Trained Large-Scale Models
Post-training has emerged as a crucial paradigm for adapting large-scale
pre-trained models to various tasks, whose effects are fully reflected by delta
parameters (i.e., the disparity between post-trained and pre-trained
parameters). While numerous studies have explored delta parameter properties
via operations like pruning, quantization, low-rank approximation, and
extrapolation, a unified framework for systematically examining these
characteristics has been lacking. In this paper, we propose a novel perspective
based on Riemann sum approximation of the loss function to elucidate delta
parameter editing operations. Our analysis categorizes existing methods into
three classes based on their post-editing performance: competitive, decreased,
and improved, explaining how they are expressed by the Riemann sum
approximation term and how they alter the model performance. Extensive
experiments on both visual and language models, including ViT, LLaMA 3, Qwen 2,
and Mistral, corroborate our theoretical findings. Furthermore, we introduce
extensions to existing techniques like DARE and BitDelta, highlighting their
limitations in leveraging the properties of delta parameters and reorganizing
them into general expressions to enhance the applicability and effectiveness of
delta parameter editing in post-trained models.
☆ A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Reinforcement Learning from Human Feedback (RLHF) has become the predominant
approach for language model (LM) alignment. At its core, RLHF uses a
margin-based loss for preference optimization, specifying ideal LM behavior
only by the difference between preferred and dispreferred responses. In this
paper, we identify a common pitfall of margin-based methods -- the
under-specification of ideal LM behavior on preferred and dispreferred
responses individually, which leads to two unintended consequences as the
margin increases: (1) The probability of dispreferred (e.g., unsafe) responses
may increase, resulting in potential safety alignment failures. (2) The
probability of preferred responses may decrease, even when those responses are
ideal. We demystify the reasons behind these problematic behaviors:
margin-based losses couple the change in the preferred probability to the
gradient of the dispreferred one, and vice versa, often preventing the
preferred probability from increasing while the dispreferred one decreases, and
thus causing a synchronized increase or decrease in both probabilities. We term
this effect, inherent in margin-based objectives, gradient entanglement.
Formally, we derive conditions for general margin-based alignment objectives
under which gradient entanglement becomes concerning: the inner product of the
gradients of preferred and dispreferred log-probabilities is large relative to
the individual gradient norms. We theoretically investigate why such inner
products can be large when aligning language models and empirically validate
our findings. Empirical implications of our framework extend to explaining
important differences in the training dynamics of various preference
optimization algorithms, and suggesting potential algorithm designs to mitigate
the under-specification issue of margin-based methods and thereby improving
language model alignment.
☆ AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, Huzefa Rangwala
Autonomy via agents using large language models (LLMs) for personalized,
standardized tasks boosts human efficiency. Automating web tasks (like booking
hotels within a budget) is increasingly sought after. Fulfilling practical
needs, the web agent also serves as an important proof-of-concept example for
various agent grounding scenarios, with its success promising advancements in
many future applications. Prior research often handcrafts web agent strategies
(e.g., prompting templates, multi-agent systems, search methods, etc.) and the
corresponding in-context examples, which may not generalize well across all
real-world scenarios. On the other hand, there has been limited study on the
misalignment between a web agent's observation/action representation and the
pre-training data of the LLM it's based on. This discrepancy is especially
notable when LLMs are primarily trained for language completion rather than
tasks involving embodied navigation actions and symbolic web elements. Our
study enhances an LLM-based web agent by simply refining its observation and
action space to better align with the LLM's capabilities. This approach enables
our base agent to significantly outperform previous methods on a wide variety
of web tasks. Specifically, on WebArena, a benchmark featuring general-purpose
web interaction tasks, our agent AgentOccam surpasses the previous
state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute
points respectively, and boosts the success rate by 26.6 points (+161%) over
similar plain web agents with its observation and action space alignment. We
achieve this without using in-context examples, new agent roles, online
feedback or search strategies. AgentOccam's simple design highlights LLMs'
impressive zero-shot performance on web tasks, and underlines the critical role
of carefully tuning observation and action spaces for LLM-based agents.
☆ Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue
Text-rich visual understanding-the ability to process environments where
dense textual content is integrated with visuals-is crucial for multimodal
large language models (MLLMs) to interact effectively with structured
environments. To enhance this capability, we propose synthesizing general
multimodal instructions from webpage UIs using text-based large language models
(LLMs). Despite lacking direct visual input, text-based LLMs are able to
process structured text representations from webpage accessibility trees. These
instructions are then paired with UI screenshots to train multimodal models. We
introduce MultiUI, a dataset containing 7.3 million samples from 1 million
websites, covering diverse multimodal tasks and UI layouts. Models trained on
MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on
VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset
Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to
non-UI domains, such as document understanding, OCR, and chart interpretation.
These results highlight the broad applicability of web UI data for advancing
text-rich visual understanding across various scenarios.
☆ De-mark: Watermark Removal in Large Language Models
Watermarking techniques offer a promising way to identify machine-generated
content via embedding covert information into the contents generated from
language models (LMs). However, the robustness of the watermarking schemes has
not been well explored. In this paper, we present De-mark, an advanced
framework designed to remove n-gram-based watermarks effectively. Our method
utilizes a novel querying strategy, termed random selection probing, which aids
in assessing the strength of the watermark and identifying the red-green list
within the n-gram watermark. Experiments on popular LMs, such as Llama3 and
ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark
removal and exploitation tasks.
☆ A Watermark for Order-Agnostic Language Models
Statistical watermarking techniques are well-established for sequentially
decoded language models (LMs). However, these techniques cannot be directly
applied to order-agnostic LMs, as the tokens in order-agnostic LMs are not
generated sequentially. In this work, we introduce Pattern-mark, a
pattern-based watermarking framework specifically designed for order-agnostic
LMs. We develop a Markov-chain-based watermark generator that produces
watermark key sequences with high-frequency key patterns. Correspondingly, we
propose a statistical pattern-based detection algorithm that recovers the key
sequence during detection and conducts statistical tests based on the count of
high-frequency patterns. Our extensive evaluations on order-agnostic LMs, such
as ProteinMPNN and CMLM, demonstrate Pattern-mark's enhanced detection
efficiency, generation quality, and robustness, positioning it as a superior
watermarking technique for order-agnostic LMs.
☆ BenTo: Benchmark Task Reduction with In-Context Transferability
Evaluating large language models (LLMs) is costly: it requires the generation
and examination of LLM outputs on a large-scale benchmark of various tasks.
This paper investigates how to efficiently reduce the tasks used to benchmark
LLMs without affecting the evaluation quality. Our study reveals that task
transferability and relevance provide critical information to identify the most
representative subset of tasks via optimizing a facility location function. We
propose a practically efficient metric for estimating the transferability
between two tasks via in-context learning (ICL). By analyzing the pairwise
transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or
FLAN) to 5% while inducing only a <4% difference to the evaluation on the
original benchmark. Compared to prior works, our method is training-free,
gradient-free, and highly efficient requiring ICL only.
☆ Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions
Large language models (LLMs) must often respond to highly ambiguous user
requests. In such cases, the LLM's best response may be to ask a clarifying
question to elicit more information. We observe existing LLMs often respond by
presupposing a single interpretation of such ambiguous requests, frustrating
users who intended a different interpretation. We speculate this is caused by
current preference data labeling practice, where LLM responses are evaluated
only on their prior contexts. To address this, we propose to assign preference
labels by simulating their expected outcomes in the future turns. This allows
LLMs to learn to ask clarifying questions when it can generate responses that
are tailored to each user interpretation in future turns. In experiments on
open-domain QA, we compare systems that trained using our proposed preference
labeling methods against standard methods, which assign preferences based on
only prior context. We evaluate systems based on their ability to ask
clarifying questions that can recover each user's interpretation and expected
answer, and find that our training with our proposed method trains LLMs to ask
clarifying questions with a 5% improvement in F1 measured against the answer
set from different interpretations of each query
☆ Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans
Humans acquire knowledge by observing the external world, but also by
introspection. Introspection gives a person privileged access to their current
state of mind (e.g., thoughts and feelings) that is not accessible to external
observers. Can LLMs introspect? We define introspection as acquiring knowledge
that is not contained in or derived from training data but instead originates
from internal states. Such a capability could enhance model interpretability.
Instead of painstakingly analyzing a model's internal workings, we could simply
ask the model about its beliefs, world models, and goals. More speculatively,
an introspective model might self-report on whether it possesses certain
internal states such as subjective feelings or desires and this could inform us
about the moral status of these states. Such self-reports would not be entirely
dictated by the model's training data.
We study introspection by finetuning LLMs to predict properties of their own
behavior in hypothetical scenarios. For example, "Given the input P, would your
output favor the short- or long-term option?" If a model M1 can introspect, it
should outperform a different model M2 in predicting M1's behavior even if M2
is trained on M1's ground-truth behavior. The idea is that M1 has privileged
access to its own behavioral tendencies, and this enables it to predict itself
better than M2 (even if M2 is generally stronger).
In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to
predict itself), we find that the model M1 outperforms M2 in predicting itself,
providing evidence for introspection. Notably, M1 continues to predict its
behavior accurately even after we intentionally modify its ground-truth
behavior. However, while we successfully elicit introspection on simple tasks,
we are unsuccessful on more complex tasks or those requiring
out-of-distribution generalization.
comment: 15 pages, 9 figures
☆ PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
Alignment of large language models (LLMs) involves training models on
preference-contrastive output pairs to adjust their responses according to
human preferences. To obtain such contrastive pairs, traditional methods like
RLHF and RLAIF rely on limited contrasting patterns, such as varying model
variants or decoding temperatures. This singularity leads to two issues: (1)
alignment is not comprehensive; and thereby (2) models are susceptible to
jailbreaking attacks. To address these issues, we investigate how to construct
more comprehensive and diversified contrasting patterns to enhance preference
data (RQ1) and verify the impact of the diversification of contrasting patterns
on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that
integrates diversified contrasting patterns across the prompt, model, and
pipeline levels, introducing six contrasting strategies that do not require
additional feedback labeling procedures. Regarding RQ2, we conduct thorough
experiments demonstrating that PopAlign significantly outperforms existing
methods, leading to more comprehensive alignment.
comment: 28 pages
☆ Quantity vs. Quality of Monolingual Source Data in Automatic Text Translation: Can It Be Too Little If It Is Too Good?
Monolingual data, being readily available in large quantities, has been used
to upscale the scarcely available parallel data to train better models for
automatic translation. Self-learning, where a model is made to learn from its
output, is one approach to exploit such data. However, it has been shown that
too much of this data can be detrimental to the performance of the model if the
available parallel data is comparatively extremely low. In this study, we
investigate whether the monolingual data can also be too little and if this
reduction, based on quality, has any effect on the performance of the
translation model. Experiments have shown that on English-German low-resource
NMT, it is often better to select only the most useful additional data, based
on quality or closeness to the domain of the test data, than utilizing all of
the available data.
☆ Optimal Quantization for Matrix Multiplication
Recent work in machine learning community proposed multiple methods for
performing lossy compression (quantization) of large matrices. This
quantization is important for accelerating matrix multiplication (main
component of large language models), which is often bottlenecked by the speed
of loading these matrices from memory. Unlike classical vector quantization and
rate-distortion theory, the goal of these new compression algorithms is to be
able to approximate not the matrices themselves, but their matrix product.
Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is
applied to each of them independently producing descriptions with $R$ bits per
entry. These representations subsequently are used by the decoder to estimate
matrix product $A^\top B$. In this work, we provide a non-asymptotic lower
bound on the mean squared error of this approximation (as a function of rate
$R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically,
we construct a universal quantizer based on nested lattices with an explicit
guarantee of approximation error for any (non-random) pair of matrices $A$, $B$
in terms of only Frobenius norms $\|A\|_F, \|B\|_F$ and $\|A^\top B\|_F$. For
iid Gaussian matrices our quantizer achieves the lower bound and is, thus,
asymptotically optimal. A practical low-complexity version of our quantizer
achieves performance quite close to optimal. In information-theoretic terms we
derive rate-distortion function for matrix multiplication of iid Gaussian
matrices.
☆ The Mystery of the Pathological Path-star Task for Language Models EMNLP 2024
The recently introduced path-star task is a minimal task designed to
exemplify limitations to the abilities of language models (Bachmann and
Nagarajan, 2024). It involves a path-star graph where multiple arms radiate
from a single starting node and each node is unique. Given the start node and a
specified target node that ends an arm, the task is to generate the arm
containing that target node. This is straightforward for a human but
surprisingly difficult for language models, which did not outperform the random
baseline. The authors hypothesized this is due to a deficiency in
teacher-forcing and the next-token prediction paradigm.
We demonstrate the task is learnable using teacher-forcing in alternative
settings and that the issue is partially due to representation. We introduce a
regularization method using structured samples of the same graph but with
differing target nodes, improving results across a variety of model types. We
provide RASP proofs showing the task is theoretically solvable. Finally, we
find settings where an encoder-only model can consistently solve the task.
comment: EMNLP 2024 Main
☆ Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors
In-context Learning (ICL) has become the primary method for performing
natural language tasks with Large Language Models (LLMs). The knowledge
acquired during pre-training is crucial for this few-shot capability, providing
the model with task priors. However, recent studies have shown that ICL
predominantly relies on retrieving task priors rather than "learning" to
perform tasks. This limitation is particularly evident in complex subjective
domains such as emotion and morality, where priors significantly influence
posterior predictions. In this work, we examine whether this is the result of
the aggregation used in corresponding datasets, where trying to combine
low-agreement, disparate annotations might lead to annotation artifacts that
create detrimental noise in the prompt. Moreover, we evaluate the posterior
bias towards certain annotators by grounding our study in appropriate,
quantitative measures of LLM priors. Our results indicate that aggregation is a
confounding factor in the modeling of subjective tasks, and advocate focusing
on modeling individuals instead. However, aggregation does not explain the
entire gap between ICL and the state of the art, meaning other factors in such
tasks also account for the observed phenomena. Finally, by rigorously studying
annotator-level labels, we find that it is possible for minority annotators to
both better align with LLMs and have their perspectives further amplified.
comment: 12 pages, 7 figures, 2 tables
☆ Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval
Large language models (LLMs) have been used to generate query expansions
augmenting original queries for improving information search. Recent studies
also explore providing LLMs with initial retrieval results to generate query
expansions more grounded to document corpus. However, these methods mostly
focus on enhancing textual similarities between search queries and target
documents, overlooking document relations. For queries like "Find me a highly
rated camera for wildlife photography compatible with my Nikon F-Mount lenses",
existing methods may generate expansions that are semantically similar but
structurally unrelated to user intents. To handle such semi-structured queries
with both textual and relational requirements, in this paper we propose a
knowledge-aware query expansion framework, augmenting LLMs with structured
document relations from knowledge graph (KG). To further address the limitation
of entity-based scoring in existing KG-based methods, we leverage document
texts as rich KG node representations and use document-based relation filtering
for our Knowledge-Aware Retrieval (KAR). Extensive experiments on three
datasets of diverse domains show the advantages of our method compared against
state-of-the-art baselines on textual and relational semi-structured retrieval.
☆ MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Zichen Zhu, Hao Tang, Yansi Li, Kunyao Lan, Yixuan Jiang, Hao Zhou, Yixiao Wang, Situo Zhang, Liangtai Sun, Lu Chen, Kai Yu
Current mobile assistants are limited by dependence on system APIs or
struggle with complex user instructions and diverse interfaces due to
restricted comprehension and decision-making abilities. To address these
challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal
large language models that enhances comprehension and planning capabilities
through a sophisticated two-level agent architecture. The high-level Global
Agent (GA) is responsible for understanding user commands, tracking history
memories, and planning tasks. The low-level Local Agent (LA) predicts detailed
actions in the form of function calls, guided by sub-tasks and memory from the
GA. Integrating a Reflection Module allows for efficient task completion and
enables the system to handle previously unseen complex tasks. MobA demonstrates
significant improvements in task execution efficiency and completion rate in
real-life evaluations, underscoring the potential of MLLM-empowered mobile
assistants.
comment: 27 pages, 6 figures, and 5 tables. We will release our source code in
a few days
☆ LLM-Human Pipeline for Cultural Context Grounding of Conversations
Conversations often adhere to well-understood social norms that vary across
cultures. For example, while "addressing parents by name" is commonplace in the
West, it is rare in most Asian cultures. Adherence or violation of such norms
often dictates the tenor of conversations. Humans are able to navigate social
situations requiring cultural awareness quite adeptly. However, it is a hard
task for NLP models.
In this paper, we tackle this problem by introducing a "Cultural Context
Schema" for conversations. It comprises (1) conversational information such as
emotions, dialogue acts, etc., and (2) cultural information such as social
norms, violations, etc. We generate ~110k social norm and violation
descriptions for ~23k conversations from Chinese culture using LLMs. We refine
them using automated verification strategies which are evaluated against
culturally aware human judgements. We organize these descriptions into
meaningful structures we call "Norm Concepts", using an interactive
human-in-loop framework. We ground the norm concepts and the descriptions in
conversations using symbolic annotation. Finally, we use the obtained dataset
for downstream tasks such as emotion, sentiment, and dialogue act detection. We
show that it significantly improves the empirical performance.
comment: 19 pages, 9 figures, 7 tables
☆ MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Traditional Retrieval-Augmented Generation (RAG) benchmarks rely on different
heuristic-based metrics for evaluation, but these require human preferences as
ground truth for reference. In contrast, arena-based benchmarks, where two
models compete each other, require an expensive Large Language Model (LLM) as a
judge for a reliable evaluation. We present an easy and efficient technique to
get the best of both worlds. The idea is to train a learning to rank model as a
"surrogate" judge using RAG-based evaluation heuristics as input, to produce a
synthetic arena-based leaderboard. Using this idea, We develop MIRAGE-Bench, a
standardized arena-based multilingual RAG benchmark for 18 diverse languages on
Wikipedia. The benchmark is constructed using MIRACL, a retrieval dataset, and
extended for multilingual generation evaluation. MIRAGE-Bench evaluates RAG
extensively coupling both heuristic features and LLM as a judge evaluator. In
our work, we benchmark 19 diverse multilingual-focused LLMs, and achieve a high
correlation (Kendall Tau ($\tau$) = 0.909) using our surrogate judge learned
using heuristic features with pairwise evaluations and between GPT-4o as a
teacher on the MIRAGE-Bench leaderboard using the Bradley-Terry framework. We
observe proprietary and large open-source LLMs currently dominate in
multilingual RAG. MIRAGE-Bench is available at:
https://github.com/vectara/mirage-bench.
☆ On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, Yongbin Li
Large language models (LLMs) achieve state-of-the-art performance on multiple
language tasks, yet their safety guardrails can be circumvented, leading to
harmful generations. In light of this, recent research on safety mechanisms has
emerged, revealing that when safety representations or component are
suppressed, the safety capability of LLMs are compromised. However, existing
research tends to overlook the safety impact of multi-head attention
mechanisms, despite their crucial role in various model functionalities. Hence,
in this paper, we aim to explore the connection between standard attention
mechanisms and safety capability to fill this gap in the safety-related
mechanistic interpretability. We propose a novel metric which tailored for
multi-head attention, the Safety Head ImPortant Score (Ships), to assess the
individual heads' contributions to model safety. Based on this, we generalize
Ships to the dataset level and further introduce the Safety Attention Head
AttRibution Algorithm (Sahara) to attribute the critical safety attention heads
inside the model. Our findings show that the special attention head has a
significant impact on safety. Ablating a single safety head allows aligned
model (e.g., Llama-2-7b-chat) to respond to 16 times more harmful queries,
while only modifying 0.006% of the parameters, in contrast to the ~ 5%
modification required in previous studies. More importantly, we demonstrate
that attention heads primarily function as feature extractors for safety and
models fine-tuned from the same base model exhibit overlapping safety heads
through comprehensive experiments. Together, our attribution approach and
findings provide a novel perspective for unpacking the black box of safety
mechanisms within large models.
comment: 28 pages, 18 figures, 7 tables
☆ Unconstrained Model Merging for Enhanced LLM Reasoning
Yiming Zhang, Baoyi He, Shengyu Zhang, Yuhao Fu, Qi Zhou, Zhijie Sang, Zijin Hong, Kejing Yang, Wenjun Wang, Jianbo Yuan, Guangning Han, Linyi Li, Chunlin Ji, Fei Wu, Hongxia Yang
Recent advancements in building domain-specific large language models (LLMs)
have shown remarkable success, especially in tasks requiring reasoning
abilities like logical inference over complex relationships and multi-step
problem solving. However, creating a powerful all-in-one LLM remains
challenging due to the need for proprietary data and vast computational
resources. As a resource-friendly alternative, we explore the potential of
merging multiple expert models into a single LLM. Existing studies on model
merging mainly focus on generalist LLMs instead of domain experts, or the LLMs
under the same architecture and size. In this work, we propose an unconstrained
model merging framework that accommodates both homogeneous and heterogeneous
model architectures with a focus on reasoning tasks. A fine-grained layer-wise
weight merging strategy is designed for homogeneous models merging, while
heterogeneous model merging is built upon the probabilistic distribution
knowledge derived from instruction-response fine-tuning data. Across 7
benchmarks and 9 reasoning-optimized LLMs, we reveal key findings that
combinatorial reasoning emerges from merging which surpasses simple additive
effects. We propose that unconstrained model merging could serve as a
foundation for decentralized LLMs, marking a notable progression from the
existing centralized LLM framework. This evolution could enhance wider
participation and stimulate additional advancement in the field of artificial
intelligence, effectively addressing the constraints posed by centralized
models.
comment: Under review
☆ Exploring the Design Space of Visual Context Representation in Video MLLMs
Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen
Video Multimodal Large Language Models (MLLMs) have shown remarkable
capability of understanding the video semantics on various downstream tasks.
Despite the advancements, there is still a lack of systematic research on
visual context representation, which refers to the scheme to select frames from
a video and further select the tokens from a frame. In this paper, we explore
the design space for visual context representation, and aim to improve the
performance of video MLLMs by finding more effective representation schemes.
Firstly, we formulate the task of visual context representation as a
constrained optimization problem, and model the language modeling loss as a
function of the number of frames and the number of embeddings (or tokens) per
frame, given the maximum visual context window size. Then, we explore the
scaling effects in frame selection and token selection respectively, and fit
the corresponding function curve by conducting extensive empirical experiments.
We examine the effectiveness of typical selection strategies and present
empirical findings to determine the two factors. Furthermore, we study the
joint effect of frame selection and token selection, and derive the optimal
formula for determining the two factors. We demonstrate that the derived
optimal settings show alignment with the best-performed results of empirical
experiments. Our code and model are available at:
https://github.com/RUCAIBox/Opt-Visor.
comment: Long Video MLLM; work in progress
☆ Pose-Based Sign Language Appearance Transfer
We introduce a method for transferring the signer's appearance in sign
language skeletal poses while preserving the sign content. Using estimated
poses, we transfer the appearance of one signer to another, maintaining natural
movements and transitions. This approach improves pose-based rendering and sign
stitching while obfuscating identity. Our experiments show that while the
method reduces signer identification accuracy, it slightly harms sign
recognition performance, highlighting a tradeoff between privacy and utility.
Our code is available at
\url{https://github.com/sign-language-processing/pose-anonymization}.
☆ HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Assessing the capabilities and limitations of large language models (LLMs)
has garnered significant interest, yet the evaluation of multiple models in
real-world scenarios remains rare. Multilingual evaluation often relies on
translated benchmarks, which typically do not capture linguistic and cultural
nuances present in the source language. This study provides an extensive
assessment of 24 LLMs on real world data collected from Indian patients
interacting with a medical chatbot in Indian English and 4 other Indic
languages. We employ a uniform Retrieval Augmented Generation framework to
generate responses, which are evaluated using both automated techniques and
human evaluators on four specific metrics relevant to our application. We find
that models vary significantly in their performance and that instruction tuned
Indic models do not always perform well on Indic language queries. Further, we
empirically show that factual correctness is generally lower for responses to
Indic queries compared to English queries. Finally, our qualitative work shows
that code-mixed and culturally relevant queries in our dataset pose challenges
to evaluated models.
comment: Under Review
☆ signwriting-evaluation: Effective Sign Language Evaluation via SignWriting
The lack of automatic evaluation metrics tailored for SignWriting presents a
significant obstacle in developing effective transcription and translation
models for signed languages. This paper introduces a comprehensive suite of
evaluation metrics specifically designed for SignWriting, including adaptations
of standard metrics such as \texttt{BLEU} and \texttt{chrF}, the application of
\texttt{CLIPScore} to SignWriting images, and a novel symbol distance metric
unique to our approach. We address the distinct challenges of evaluating single
signs versus continuous signing and provide qualitative demonstrations of
metric efficacy through score distribution analyses and nearest-neighbor
searches within the SignBank corpus. Our findings reveal the strengths and
limitations of each metric, offering valuable insights for future advancements
using SignWriting. This work contributes essential tools for evaluating
SignWriting models, facilitating progress in the field of sign language
processing. Our code is available at
\url{https://github.com/sign-language-processing/signwriting-evaluation}.
☆ ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization EMNLP 2023
Dialogue agents have been receiving increasing attention for years, and this
trend has been further boosted by the recent progress of large language models
(LLMs). Stance detection and dialogue summarization are two core tasks of
dialogue agents in application scenarios that involve argumentative dialogues.
However, research on these tasks is limited by the insufficiency of public
datasets, especially for non-English languages. To address this language
resource gap in Chinese, we present ORCHID (Oral Chinese Debate), the first
Chinese dataset for benchmarking target-independent stance detection and debate
summarization. Our dataset consists of 1,218 real-world debates that were
conducted in Chinese on 476 unique topics, containing 2,436 stance-specific
summaries and 14,133 fully annotated utterances. Besides providing a versatile
testbed for future research, we also conduct an empirical study on the dataset
and propose an integrated task. The results show the challenging nature of the
dataset and suggest a potential of incorporating stance detection in
summarization for argumentative dialogue.
comment: In EMNLP 2023
☆ VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks
Shailaja Keyur Sampat, Mutsumi Nakamura, Shankar Kailas, Kartik Aggarwal, Mandy Zhou, Yezhou Yang, Chitta Baral
Deriving inference from heterogeneous inputs (such as images, text, and
audio) is an important skill for humans to perform day-to-day tasks. A similar
ability is desirable for the development of advanced Artificial Intelligence
(AI) systems. While state-of-the-art models are rapidly closing the gap with
human-level performance on diverse computer vision and NLP tasks separately,
they struggle to solve tasks that require joint reasoning over visual and
textual modalities. Inspired by GLUE (Wang et. al., 2018)- a multitask
benchmark for natural language understanding, we propose VL-GLUE in this paper.
VL-GLUE consists of over 100k samples spanned across seven different tasks,
which at their core require visuo-linguistic reasoning. Moreover, our benchmark
comprises of diverse image types (from synthetically rendered figures, and
day-to-day scenes to charts and complex diagrams) and includes a broad variety
of domain-specific text (from cooking, politics, and sports to high-school
curricula), demonstrating the need for multi-modal understanding in the
real-world. We show that this benchmark is quite challenging for existing
large-scale vision-language models and encourage development of systems that
possess robust visuo-linguistic reasoning capabilities.
comment: 18 pages, 7 figures
☆ Red and blue language: Word choices in the Trump & Harris 2024 presidential debate
Political debates are a peculiar type of political discourse, in which
candidates directly confront one another, addressing not only the the
moderator's questions, but also their opponent's statements, as well as the
concerns of voters from both parties and undecided voters. Therefore, language
is adjusted to meet specific expectations and achieve persuasion. We analyse
how the language of Trump and Harris during the debate (September 10th 2024)
differs in relation to the following semantic and pragmatic features, for which
we formulated targeted hypotheses: framing values and ideology, appealing to
emotion, using words with different degrees of concreteness and specificity,
addressing others through singular or plural pronouns. Our findings include:
differences in the use of figurative frames (Harris often framing issues around
recovery and empowerment, Trump often focused on crisis and decline); similar
use of emotional language, with Trump showing a slight higher tendency toward
negativity and toward less subjective language compared to Harris; no
significant difference in the specificity of candidates' responses; similar use
of abstract language, with Trump showing more variability than Harris,
depending on the subject discussed; differences in addressing the opponent,
with Trump not mentioning Harris by name, while Harris referring to Trump
frequently; different uses of pronouns, with Harris using both singular and
plural pronouns equally, while Trump using more singular pronouns. The results
are discussed in relation to previous literature on Red and Blue language,
which refers to distinct linguistic patterns associated with conservative (Red)
and liberal (Blue) political ideologies.
comment: Submitted to PLOS ONE, under review
☆ A new approach for fine-tuning sentence transformers for intent classification and out-of-scope detection tasks
In virtual assistant (VA) systems it is important to reject or redirect user
queries that fall outside the scope of the system. One of the most accurate
approaches for out-of-scope (OOS) rejection is to combine it with the task of
intent classification on in-scope queries, and to use methods based on the
similarity of embeddings produced by transformer-based sentence encoders.
Typically, such encoders are fine-tuned for the intent-classification task,
using cross-entropy loss. Recent work has shown that while this produces
suitable embeddings for the intent-classification task, it also tends to
disperse in-scope embeddings over the full sentence embedding space. This
causes the in-scope embeddings to potentially overlap with OOS embeddings,
thereby making OOS rejection difficult. This is compounded when OOS data is
unknown. To mitigate this issue our work proposes to regularize the
cross-entropy loss with an in-scope embedding reconstruction loss learned using
an auto-encoder. Our method achieves a 1-4% improvement in the area under the
precision-recall curve for rejecting out-of-sample (OOS) instances, without
compromising intent classification performance.
comment: Appearing at Empirical Methods in Natural Language Processing 2025 -
Industry Track
☆ SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
While prior work has explored whether large language models (LLMs) possess a
"theory of mind" (ToM) - the ability to attribute mental states to oneself and
others - there has been little work testing whether LLMs can implicitly apply
such knowledge to predict behavior, or to judge whether an observed behavior is
rational. Such skills are critical for appropriate interaction in social
environments. We create a new dataset, SimpleTom, containing concise, diverse
stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the
can in the supermarket and walks to the cashier."), each with three questions
that test different degrees of ToM reasoning, asking models to predict (a)
mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for
the chips or report the mold?"), and (c) judgment ("Mary paid for the chips.
Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to
systematically explore downstream reasoning requiring knowledge of mental
states in realistic scenarios. Our experimental results are intriguing: While
most models can reliably predict mental state on our dataset (a), they often
fail to correctly predict the behavior (b), and fare even worse at judging
whether given behaviors are reasonable (c), despite being correctly aware of
the protagonist's mental state should make such secondary predictions obvious.
We further show that we can help models do better at (b) and (c) via
interventions such as reminding the model of its earlier mental state answer
and mental-state-specific chain-of-thought prompting, raising the action
prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment
accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models
can be coaxed to perform well, it requires task-specific interventions, and the
natural model performances remain low, a cautionary tale for LLM deployment.
☆ An Active Learning Framework for Inclusive Generation by Large Language Models
Ensuring that Large Language Models (LLMs) generate text representative of
diverse sub-populations is essential, particularly when key concepts related to
under-represented groups are scarce in the training data. We address this
challenge with a novel clustering-based active learning framework, enhanced
with knowledge distillation. The proposed framework transforms the intermediate
outputs of the learner model, enabling effective active learning for generative
tasks for the first time. Integration of clustering and knowledge distillation
yields more representative models without prior knowledge of underlying data
distribution and overbearing human efforts. We validate our approach in
practice through case studies in counter-narration and style transfer. We
construct two new datasets in tandem with model training, showing a performance
improvement of 2%-10% over baseline models. Our results also show more
consistent performance across various data subgroups and increased lexical
diversity, underscoring our model's resilience to skewness in available data.
Further, our results show that the data acquired via our approach improves the
performance of secondary models not involved in the learning loop, showcasing
practical utility of the framework.
☆ Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation
LLM self-evaluation relies on the LLM's own ability to estimate response
correctness, which can greatly improve its deployment reliability. In this
research track, we propose the Chain-of-Embedding (CoE) in the latent space to
enable LLMs to perform output-free self-evaluation. CoE consists of all
progressive hidden states produced during the inference time, which can be
treated as the latent thinking path of LLMs. We find that when LLMs respond
correctly and incorrectly, their CoE features differ, these discrepancies
assist us in estimating LLM response correctness. Experiments in four diverse
domains and seven LLMs fully demonstrate the effectiveness of our method.
Meanwhile, its label-free design intent without any training and
millisecond-level computational cost ensure real-time feedback in large-scale
scenarios. More importantly, we provide interesting insights into LLM response
correctness from the perspective of hidden state changes inside LLMs.
comment: 33 pages, 18 figures, 12 tables
★ A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J. H. Liu
Enabling Large Language Models (LLMs) to handle a wider range of complex
tasks (e.g., coding, math) has drawn great attention from many researchers. As
LLMs continue to evolve, merely increasing the number of model parameters
yields diminishing performance improvements and heavy computational costs.
Recently, OpenAI's o1 model has shown that inference strategies (i.e.,
Test-time Compute methods) can also significantly enhance the reasoning
capabilities of LLMs. However, the mechanisms behind these methods are still
unexplored. In our work, to investigate the reasoning patterns of o1, we
compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent
Workflow, and Self-Refine) by using OpenAI's GPT-4o as a backbone on general
reasoning benchmarks in three domains (i.e., math, coding, commonsense
reasoning). Specifically, first, our experiments show that the o1 model has
achieved the best performance on most datasets. Second, as for the methods of
searching diverse responses (e.g., BoN), we find the reward models' capability
and the search space both limit the upper boundary of these methods. Third, as
for the methods that break the problem into many sub-problems, the Agent
Workflow has achieved better performance than Step-wise BoN due to the
domain-specific system prompt for planning better reasoning processes. Fourth,
it is worth mentioning that we have summarized six reasoning patterns of o1,
and provided a detailed analysis on several reasoning benchmarks.
☆ H2OVL-Mississippi Vision Language Models Technical Report
Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, Sri Satish Ambati
Smaller vision-language models (VLMs) are becoming increasingly important for
privacy-focused, on-device applications due to their ability to run efficiently
on consumer hardware for processing enterprise commercial documents and images.
These models require strong language understanding and visual capabilities to
enhance human-machine interaction. To address this need, we present
H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs
using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny
model with 0.8 billion parameters that specializes in text recognition,
achieving state of the art performance on the Text Recognition portion of
OCRBench and surpassing much larger models in this area. Additionally, we are
releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use
cases, exhibiting highly competitive metrics across various academic
benchmarks. Both models build upon our prior work with H2O-Danube language
models, extending their capabilities into the visual domain. We release them
under the Apache 2.0 license, making VLMs accessible to everyone, democratizing
document AI and visual LLMs.
☆ MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling
Integrating tools into Large Language Models (LLMs) has facilitated the
widespread application. Despite this, in specialized downstream task contexts,
reliance solely on tools is insufficient to fully address the complexities of
the real world. This particularly restricts the effective deployment of LLMs in
fields such as medicine. In this paper, we focus on the downstream tasks of
medical calculators, which use standardized tests to assess an individual's
health status. We introduce MeNTi, a universal agent architecture for LLMs.
MeNTi integrates a specialized medical toolkit and employs meta-tool and nested
calling mechanisms to enhance LLM tool utilization. Specifically, it achieves
flexible tool selection and nested tool calling to address practical issues
faced in intricate medical scenarios, including calculator selection, slot
filling, and unit conversion. To assess the capabilities of LLMs for
quantitative assessment throughout the clinical process of calculator
scenarios, we introduce CalcQA. This benchmark requires LLMs to use medical
calculators to perform calculations and assess patient health status. CalcQA is
constructed by professional physicians and includes 100 case-calculator pairs,
complemented by a toolkit of 281 medical tools. The experimental results
demonstrate significant performance improvements with our framework. This
research paves new directions for applying LLMs in demanding scenarios of
medicine.
☆ Large Language Models as Narrative-Driven Recommenders
Narrative-driven recommenders aim to provide personalized suggestions for
user requests expressed in free-form text such as "I want to watch a thriller
with a mind-bending story, like Shutter Island." Although large language models
(LLMs) have been shown to excel in processing general natural language queries,
their effectiveness for handling such recommendation requests remains
relatively unexplored. To close this gap, we compare the performance of 38
open- and closed-source LLMs of various sizes, such as LLama 3.2 and GPT-4o, in
a movie recommendation setting. For this, we utilize a gold-standard,
crowdworker-annotated dataset of posts from reddit's movie suggestion community
and employ various prompting strategies, including zero-shot, identity, and
few-shot prompting. Our findings demonstrate the ability of LLMs to generate
contextually relevant movie recommendations, significantly outperforming other
state-of-the-art approaches, such as doc2vec. While we find that closed-source
and large-parameterized models generally perform best, medium-sized open-source
models remain competitive, being only slightly outperformed by their more
computationally expensive counterparts. Furthermore, we observe no significant
differences across prompting strategies for most models, underscoring the
effectiveness of simple approaches such as zero-shot prompting for
narrative-driven recommendations. Overall, this work offers valuable insights
for recommender system researchers as well as practitioners aiming to integrate
LLMs into real-world recommendation tools.
comment: Under review; 19 pages
☆ Enhancing Fact Retrieval in PLMs through Truthfulness
Pre-trained Language Models (PLMs) encode various facts about the world at
their pre-training phase as they are trained to predict the next or missing
word in a sentence. There has a been an interest in quantifying and improving
the amount of facts that can be extracted from PLMs, as they have been
envisioned to act as soft knowledge bases, which can be queried in natural
language. Different approaches exist to enhance fact retrieval from PLM. Recent
work shows that the hidden states of PLMs can be leveraged to determine the
truthfulness of the PLMs' inputs. Leveraging this finding to improve factual
knowledge retrieval remains unexplored. In this work, we investigate the use of
a helper model to improve fact retrieval. The helper model assesses the
truthfulness of an input based on the corresponding hidden states
representations from the PLMs. We evaluate this approach on several masked PLMs
and show that it enhances fact retrieval by up to 33\%. Our findings highlight
the potential of hidden states representations from PLMs in improving their
factual knowledge retrieval.
☆ Integrating Temporal Representations for Dynamic Memory Retrieval and Management in Large Language Models
Conventional dialogue agents often struggle with effective memory recall,
leading to redundant retrieval and inadequate management of unique user
associations. To address this, we propose SynapticRAG, a novel approach
integrating synaptic dynamics into Retrieval-Augmented Generation (RAG).
SynapticRAG integrates temporal representations into memory vectors, mimicking
biological synapses by differentiating events based on occurrence times and
dynamically updating memory significance. This model employs temporal scoring
for memory connections and a synaptic-inspired propagation control mechanism.
Experiments across English, Japanese, and Chinese datasets demonstrate
SynapticRAG's superiority over existing methods, including traditional RAG,
with up to 14.66\% improvement in memory retrieval accuracy. Our approach
advances context-aware dialogue AI systems by enhancing long-term context
maintenance and specific information extraction from conversations.
☆ Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks ?
Large language models (LLMs) inherit biases from their training data and
alignment processes, influencing their responses in subtle ways. While many
studies have examined these biases, little work has explored their robustness
during interactions. In this paper, we introduce a novel approach where two
instances of an LLM engage in self-debate, arguing opposing viewpoints to
persuade a neutral version of the model. Through this, we evaluate how firmly
biases hold and whether models are susceptible to reinforcing misinformation or
shifting to harmful viewpoints. Our experiments span multiple LLMs of varying
sizes, origins, and languages, providing deeper insights into bias persistence
and flexibility across linguistic and cultural contexts.
☆ GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models
Geometry problem-solving demands advanced reasoning abilities to process
multimodal inputs and employ mathematical knowledge effectively.
Vision-language models (VLMs) have made significant progress in various
multimodal tasks. Yet, they still struggle with geometry problems and are
significantly limited by their inability to perform mathematical operations not
seen during pre-training, such as calculating the cosine of an arbitrary angle,
and by difficulties in correctly applying relevant geometry formulas. To
overcome these challenges, we present GeoCoder, which leverages modular
code-finetuning to generate and execute code using a predefined geometry
function library. By executing the code, we achieve accurate and deterministic
calculations, contrasting the stochastic nature of autoregressive token
prediction, while the function library minimizes errors in formula usage. We
also propose a multimodal retrieval-augmented variant of GeoCoder, named
RAG-GeoCoder, which incorporates a non-parametric memory module for retrieving
functions from the geometry library, thereby reducing reliance on parametric
memory. Our modular code-finetuning approach enhances the geometric reasoning
capabilities of VLMs, yielding an average improvement of over 16% across
various question complexities on the GeomVerse dataset compared to other
finetuning methods.
★ RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards
Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong
Retrieval-Augmented Generation (RAG) has proven its effectiveness in
mitigating hallucinations in Large Language Models (LLMs) by retrieving
knowledge from external resources. To adapt LLMs for RAG pipelines, current
approaches use instruction tuning to optimize LLMs, improving their ability to
utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses
on equipping LLMs to handle diverse RAG tasks using different instructions.
However, it trains RAG modules to overfit training signals and overlooks the
varying data preferences among agents within the RAG system. In this paper, we
propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG
systems by aligning data preferences between different RAG modules. DDR works
by collecting the rewards to optimize each agent with a rollout method. This
method prompts agents to sample some potential responses as perturbations,
evaluates the impact of these perturbations on the whole RAG system, and
subsequently optimizes the agent to produce outputs that improve the
performance of the RAG system. Our experiments on various knowledge-intensive
tasks demonstrate that DDR significantly outperforms the SFT method,
particularly for LLMs with smaller-scale parameters that depend more on the
retrieved knowledge. Additionally, DDR exhibits a stronger capability to align
the data preference between RAG modules. The DDR method makes generation module
more effective in extracting key information from documents and mitigating
conflicts between parametric memory and external knowledge. All codes are
available at https://github.com/OpenMatch/RAG-DDR.
☆ MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Large language models (LLMs) can solve arithmetic word problems with high
accuracy, but little is known about how well they generalize to problems that
are more complex than the ones on which they have been trained. Empirical
investigations of such questions are impeded by two major flaws of current
evaluations: (i) much of the evaluation data is contaminated, in the sense that
it has already been seen during training, and (ii) benchmark datasets do not
capture how problem proofs may be arbitrarily complex in various ways. As a
step towards addressing these issues, we present a framework for evaluating
LLMs on problems that have arbitrarily complex arithmetic proofs, called
MathGAP. MathGAP generates problems that follow fixed proof specifications --
along with chain-of-thought reasoning annotations -- enabling systematic
studies on generalization with respect to arithmetic proof complexity. We apply
MathGAP to analyze how in-context learning interacts with generalization to
problems that have more complex proofs. We find that among the models tested,
most show a significant decrease in performance as proofs get deeper and wider.
This effect is more pronounced in complex, nonlinear proof structures, which
are challenging even for GPT-4o. Surprisingly, providing in-context examples
from the same distribution as the test set is not always beneficial for
performance. In particular, zero-shot prompting as well as demonstrating a
diverse range of examples that are less complex than the test data sometimes
yield similar or higher accuracies.
comment: Preprint
☆ Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques
Text generation is the automated process of producing written or spoken
language using computational methods. It involves generating coherent and
contextually relevant text based on predefined rules or learned patterns.
However, challenges in text generation arise from maintaining coherence,
ensuring diversity and creativity, and avoiding biases or inappropriate
content. This research paper developed a novel approach to improve text
generation in the context of joint Natural Language Generation (NLG) and
Natural Language Understanding (NLU) learning. The data is prepared by
gathering and preprocessing annotated datasets, including cleaning,
tokenization, stemming, and stop-word removal. Feature extraction techniques
such as POS tagging, Bag of words, and Term Frequency-Inverse Document
Frequency (TF-IDF) are applied. Transformer-based encoders and decoders,
capturing long range dependencies and improving source-target sequence
modelling. Pre-trained language models like Optimized BERT are incorporated,
along with a Hybrid Redfox Artificial Hummingbird Algorithm (HRAHA).
Reinforcement learning with policy gradient techniques, semi-supervised
training, improved attention mechanisms, and differentiable approximations like
straight-through Gumbel SoftMax estimator are employed to fine-tune the models
and handle complex linguistic tasks effectively. The proposed model is
implemented using Python.
☆ Repetition Neurons: How Do Language Models Produce Repetitions?
This paper introduces repetition neurons, regarded as skill neurons
responsible for the repetition problem in text generation tasks. These neurons
are progressively activated more strongly as repetition continues, indicating
that they perceive repetition as a task to copy the previous context
repeatedly, similar to in-context learning. We identify these repetition
neurons by comparing activation values before and after the onset of repetition
in texts generated by recent pre-trained language models. We analyze the
repetition neurons in three English and one Japanese pre-trained language
models and observe similar patterns across them.
☆ Seeing Through VisualBERT: A Causal Adventure on Memetic Landscapes EMNLP
Detecting offensive memes is crucial, yet standard deep neural network
systems often remain opaque. Various input attribution-based methods attempt to
interpret their behavior, but they face challenges with implicitly offensive
memes and non-causal attributions. To address these issues, we propose a
framework based on a Structural Causal Model (SCM). In this framework,
VisualBERT is trained to predict the class of an input meme based on both meme
input and causal concepts, allowing for transparent interpretation. Our
qualitative evaluation demonstrates the framework's effectiveness in
understanding model behavior, particularly in determining whether the model was
right due to the right reason, and in identifying reasons behind
misclassification. Additionally, quantitative analysis assesses the
significance of proposed modelling choices, such as de-confounding, adversarial
learning, and dynamic routing, and compares them with input attribution
methods. Surprisingly, we find that input attribution methods do not guarantee
causality within our framework, raising questions about their reliability in
safety-critical applications. The project page is at:
https://newcodevelop.github.io/causality_adventure/
comment: Accepted at EMNLP Findings 2024
☆ IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection
As large language models (LLMs) continue to advance, instruction tuning has
become critical for improving their ability to generate accurate and
contextually appropriate responses. Although numerous instruction-tuning
datasets have been developed to enhance LLM performance, selecting high-quality
instruction data from large source datasets typically demands significant human
effort. In this work, we introduce $\textbf{IterSelectTune}$, an efficient,
cost-effective iterative training policy for selecting high-quality instruction
data with no human involvement and limited reliance on GPT-4. By fine-tuning on
approximately 20\% of the source data, our method consistently outperforms
models fine-tuned on the full dataset across multiple benchmarks and public
test datasets. These results highlight the effectiveness of our approach in
enhancing LLM performance while reducing the computational resources required
for instruction tuning.
☆ Progressive Mixed-Precision Decoding for Efficient LLM Inference
In spite of the great potential of large language models (LLMs) across
various tasks, their deployment on resource-constrained devices remains
challenging due to their excessive computational and memory demands.
Quantization has emerged as an effective solution by storing weights in reduced
precision. However, utilizing low precisions (i.e.~2/3-bit) to substantially
alleviate the memory-boundedness of LLM decoding, still suffers from
prohibitive performance drop. In this work, we argue that existing approaches
fail to explore the diversity in computational patterns, redundancy, and
sensitivity to approximations of the different phases of LLM inference,
resorting to a uniform quantization policy throughout. Instead, we propose a
novel phase-aware method that selectively allocates precision during different
phases of LLM inference, achieving both strong context extraction during
prefill and efficient memory bandwidth utilization during decoding. To further
address the memory-boundedness of the decoding phase, we introduce Progressive
Mixed-Precision Decoding (PMPD), a technique that enables the gradual lowering
of precision deeper in the generated sequence, together with a spectrum of
precision-switching schedulers that dynamically drive the precision-lowering
decisions in either task-adaptive or prompt-adaptive manner. Extensive
evaluation across diverse language tasks shows that when targeting Nvidia GPUs,
PMPD achieves 1.4$-$12.2$\times$ speedup in matrix-vector multiplications over
fp16 models, while when targeting an LLM-optimized NPU, our approach delivers a
throughput gain of 3.8$-$8.0$\times$ over fp16 models and up to 1.54$\times$
over uniform quantization approaches while preserving the output quality.
☆ Breaking the Manual Annotation Bottleneck: Creating a Comprehensive Legal Case Criticality Dataset through Semi-Automated Labeling
Predicting case criticality helps legal professionals in the court system
manage large volumes of case law. This paper introduces the Criticality
Prediction dataset, a new resource for evaluating the potential influence of
Swiss Federal Supreme Court decisions on future jurisprudence. Unlike existing
approaches that rely on resource-intensive manual annotations, we
semi-automatically derive labels leading to a much larger dataset than
otherwise possible. Our dataset features a two-tier labeling system: (1) the
LD-Label, which identifies cases published as Leading Decisions (LD), and (2)
the Citation-Label, which ranks cases by their citation frequency and recency.
This allows for a more nuanced evaluation of case importance. We evaluate
several multilingual models, including fine-tuned variants and large language
models, and find that fine-tuned models consistently outperform zero-shot
baselines, demonstrating the need for task-specific adaptation. Our
contributions include the introduction of this task and the release of a
multilingual dataset to the research community.
☆ MedINST: Meta Dataset of Biomedical Instructions
The integration of large language model (LLM) techniques in the field of
medical analysis has brought about significant advancements, yet the scarcity
of large, diverse, and well-annotated datasets remains a major challenge.
Medical data and tasks, which vary in format, size, and other parameters,
require extensive preprocessing and standardization for effective use in
training LLMs. To address these challenges, we introduce MedINST, the Meta
Dataset of Biomedical Instructions, a novel multi-domain, multi-task
instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over
7 million training samples, making it the most comprehensive biomedical
instruction dataset to date. Using MedINST as the meta dataset, we curate
MedINST32, a challenging benchmark with different task difficulties aiming to
evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and
evaluate on MedINST32, showcasing enhanced cross-task generalization.
☆ Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland
Legal research is a time-consuming task that most lawyers face on a daily
basis. A large part of legal research entails looking up relevant caselaw and
bringing it in relation to the case at hand. Lawyers heavily rely on summaries
(also called headnotes) to find the right cases quickly. However, not all
decisions are annotated with headnotes and writing them is time-consuming.
Automated headnote creation has the potential to make hundreds of thousands of
decisions more accessible for legal research in Switzerland alone. To kickstart
this, we introduce the Swiss Leading Decision Summarization ( SLDS) dataset, a
novel cross-lingual resource featuring 18K court rulings from the Swiss Federal
Supreme Court (SFSC), in German, French, and Italian, along with German
headnotes. We fine-tune and evaluate three mT5 variants, along with proprietary
models. Our analysis highlights that while proprietary models perform well in
zero-shot and one-shot settings, fine-tuned smaller models still provide a
strong competitive edge. We publicly release the dataset to facilitate further
research in multilingual legal summarization and the development of assistive
technologies for legal professionals
☆ Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR
Automatic speech recognition (ASR) for low-resource languages remains a
challenge due to the scarcity of labeled training data. Parameter-efficient
fine-tuning and text-only adaptation are two popular methods that have been
used to address such low-resource settings. In this work, we investigate how
these techniques can be effectively combined using a multilingual multimodal
model like SeamlessM4T. Multimodal models are able to leverage unlabeled text
via text-only adaptation with further parameter-efficient ASR fine-tuning, thus
boosting ASR performance. We also show cross-lingual transfer from a
high-resource language, achieving up to a relative 17% WER reduction over a
baseline in a zero-shot setting without any labeled speech.
☆ NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task
This paper describes NLIP Lab's multilingual machine translation system for
the WAT24 shared task on multilingual Indic MT task for 22 scheduled languages
belonging to 4 language families. We explore pre-training for Indic languages
using alignment agreement objectives. We utilize bi-lingual dictionaries to
substitute words from source sentences. Furthermore, we fine-tuned language
direction-specific multilingual translation models using small and high-quality
seed data. Our primary submission is a 243M parameters multilingual translation
model covering 22 Indic languages. In the IN22-Gen benchmark, we achieved an
average chrF++ score of 46.80 and 18.19 BLEU score for the En-Indic direction.
In the Indic-En direction, we achieved an average chrF++ score of 56.34 and
30.82 BLEU score. In the In22-Conv benchmark, we achieved an average chrF++
score of 43.43 and BLEU score of 16.58 in the En-Indic direction, and in the
Indic-En direction, we achieved an average of 52.44 and 29.77 for chrF++ and
BLEU respectively. Our model\footnote{Our code and models are available at
\url{https://github.com/maharajbrahma/WAT2024-MultiIndicMT}} is competitive
with IndicTransv1 (474M parameter model).
comment: WMT 24 WAT Shared Task IndicMultiMT (Best System)
☆ Similarity-Dissimilarity Loss with Supervised Contrastive Learning for Multi-label Classification
Supervised contrastive learning has been explored in making use of label
information for multi-label classification, but determining positive samples in
multi-label scenario remains challenging. Previous studies have examined
strategies for identifying positive samples, considering label overlap
proportion between anchors and samples. However, they ignore various relations
between given anchors and samples, as well as how to dynamically adjust the
weights in contrastive loss functions based on different relations, leading to
great ambiguity. In this paper, we introduce five distinct relations between
multi-label samples and propose a Similarity-Dissimilarity Loss with
contrastive learning for multi-label classification. Our loss function
re-weights the loss by computing the similarity and dissimilarity between
positive samples and a given anchor based on the introduced relations. We
mainly conduct experiments for multi-label text classification on MIMIC
datasets, then further extend the evaluation on MS-COCO. The Experimental
results show that our proposed loss effectively improves the performance on all
encoders under supervised contrastive learning paradigm, demonstrating its
effectiveness and robustness.
★ Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models
Chengyu Du, Jinyi Han, Yizhou Ying, Aili Chen, Qianyu He, Haokun Zhao, Sirui Xia, Haoran Guo, Jiaqing Liang, Zulong Chen, Liangyue Li, Yanghua Xiao
Recent advancements in large language models (LLMs) have demonstrated that
progressive refinement, rather than providing a single answer, results in more
accurate and thoughtful outputs. However, existing methods often rely heavily
on supervision signals to evaluate previous responses, making it difficult to
assess output quality in more open-ended scenarios effectively. Additionally,
these methods are typically designed for specific tasks, which limits their
generalization to new domains. To address these limitations, we propose
Progressive Thought Refinement (PTR), a framework that enables LLMs to refine
their responses progressively. PTR operates in two phases: (1) Thought data
construction stage: We propose a weak and strong model collaborative selection
strategy to build a high-quality progressive refinement dataset to ensure
logical consistency from thought to answers, and the answers are gradually
refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training
structure to mask the "thought" and adjust loss weights to encourage LLMs to
refine prior thought, teaching them to implicitly understand "how to improve"
rather than "what is correct." Experimental results show that PTR significantly
enhances LLM performance across ten diverse tasks (avg. from 49.6% to 53.5%)
without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also
demonstrate substantial improvements in the quality of responses beyond mere
accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.
comment: 10 pages, 4 figures
☆ Attr-Int: A Simple and Effective Entity Alignment Framework for Heterogeneous Knowledge Graphs
Entity alignment (EA) refers to the task of linking entities in different
knowledge graphs (KGs). Existing EA methods rely heavily on structural
isomorphism. However, in real-world KGs, aligned entities usually have
non-isomorphic neighborhood structures, which paralyses the application of
these structure-dependent methods. In this paper, we investigate and tackle the
problem of entity alignment between heterogeneous KGs. First, we propose two
new benchmarks to closely simulate real-world EA scenarios of heterogeneity.
Then we conduct extensive experiments to evaluate the performance of
representative EA methods on the new benchmarks. Finally, we propose a simple
and effective entity alignment framework called Attr-Int, in which innovative
attribute information interaction methods can be seamlessly integrated with any
embedding encoder for entity alignment, improving the performance of existing
entity alignment techniques. Experiments demonstrate that our framework
outperforms the state-of-the-art approaches on two new benchmarks.
★ MoR: Mixture of Ranks for Low-Rank Adaptation Tuning
Low-Rank Adaptation (LoRA) drives research to align its performance with full
fine-tuning. However, significant challenges remain: (1) Simply increasing the
rank size of LoRA does not effectively capture high-rank information, which
leads to a performance bottleneck.(2) MoE-style LoRA methods substantially
increase parameters and inference latency, contradicting the goals of efficient
fine-tuning and ease of application. To address these challenges, we introduce
Mixture of Ranks (MoR), which learns rank-specific information for different
tasks based on input and efficiently integrates multi-rank information. We
firstly propose a new framework that equates the integration of multiple LoRAs
to expanding the rank of LoRA. Moreover, we hypothesize that low-rank LoRA
already captures sufficient intrinsic information, and MoR can derive high-rank
information through mathematical transformations of the low-rank components.
Thus, MoR can reduces the learning difficulty of LoRA and enhances its
multi-task capabilities. MoR achieves impressive results, with MoR delivering a
1.31\% performance improvement while using only 93.93\% of the parameters
compared to baseline methods.
comment: 11 pages, 7 figures
☆ Towards Hybrid Intelligence in Journalism: Findings and Lessons Learnt from a Collaborative Analysis of Greek Political Rhetoric by ChatGPT and Humans
Thanasis Troboukis, Kelly Kiki, Antonis Galanopoulos, Pavlos Sermpezis, Stelios Karamanidis, Ilias Dimitriadis, Athena Vakali
This chapter introduces a research project titled "Analyzing the Political
Discourse: A Collaboration Between Humans and Artificial Intelligence", which
was initiated in preparation for Greece's 2023 general elections. The project
focused on the analysis of political leaders' campaign speeches, employing
Artificial Intelligence (AI), in conjunction with an interdisciplinary team
comprising journalists, a political scientist, and data scientists. The chapter
delves into various aspects of political discourse analysis, including
sentiment analysis, polarization, populism, topic detection, and Named Entities
Recognition (NER). This experimental study investigates the capabilities of
large language model (LLMs), and in particular OpenAI's ChatGPT, for analyzing
political speech, evaluates its strengths and weaknesses, and highlights the
essential role of human oversight in using AI in journalism projects and
potentially other societal sectors. The project stands as an innovative example
of human-AI collaboration (known also as "hybrid intelligence") within the
realm of digital humanities, offering valuable insights for future initiatives.
☆ Linguistically Grounded Analysis of Language Models using Shapley Head Values
Understanding how linguistic knowledge is encoded in language models is
crucial for improving their generalisation capabilities. In this paper, we
investigate the processing of morphosyntactic phenomena, by leveraging a
recently proposed method for probing language models via Shapley Head Values
(SHVs). Using the English language BLiMP dataset, we test our approach on two
widely used models, BERT and RoBERTa, and compare how linguistic constructions
such as anaphor agreement and filler-gap dependencies are handled. Through
quantitative pruning and qualitative clustering analysis, we demonstrate that
attention heads responsible for processing related linguistic phenomena cluster
together. Our results show that SHV-based attributions reveal distinct patterns
across both models, providing insights into how language models organize and
process linguistic information. These findings support the hypothesis that
language models learn subnetworks corresponding to linguistic theory, with
potential implications for cross-linguistic model analysis and interpretability
in Natural Language Processing (NLP).
☆ Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra
Evaluating machine-generated text remains a significant challenge in NLP,
especially for non-English languages. Current methodologies, including
automated metrics, human assessments, and LLM-based evaluations, predominantly
focus on English, revealing a significant gap in multilingual evaluation
frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an
extensible framework that includes evaluator LLMs (Hercule) and a novel test
set (Recon) specifically designed for multilingual evaluation. Our test set
features 500 human-annotated instructions spanning various task capabilities
along with human judgment scores across six languages. This would enable
benchmarking of general-purpose multilingual LLMs and facilitate
meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a
cross-lingual evaluation model that addresses the scarcity of reference answers
in the target language by learning to assign scores to responses based on
easily available reference answers in English. Our experiments demonstrate that
Hercule aligns more closely with human judgments compared to proprietary
models, demonstrating the effectiveness of such cross-lingual evaluation in low
resource scenarios. Further, it is also effective in zero-shot evaluation on
unseen languages. This study is the first comprehensive examination of
cross-lingual evaluation using LLMs, presenting a scalable and effective
approach for multilingual assessment. All code, datasets, and models will be
publicly available to enable further research in this important area.
☆ Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence
Large language models (LLMs) have shown impressive alignment with human
cognitive processes, raising questions about the extent of their similarity to
human cognition. This study investigates whether LLMs, specifically ChatGPT,
possess metacognitive monitoring abilities akin to humans-particularly in
predicting memory performance on an item-by-item basis. We employed a
cross-agent prediction model to compare the metacognitive performance of humans
and ChatGPT in a language-based memory task involving garden-path sentences
preceded by either fitting or unfitting context sentences. Both humans and
ChatGPT rated the memorability of these sentences; humans then completed a
surprise recognition memory test. Our findings reveal a significant positive
relationship between humans' memorability ratings and their actual recognition
performance, indicating reliable metacognitive monitoring. In contrast, ChatGPT
did not exhibit a similar predictive capability. Bootstrapping analyses
demonstrated that none of the GPT models tested (GPT-3.5-turbo, GPT-4-turbo,
GPT-4o) could accurately predict human memory performance on a per-item basis.
This suggests that, despite their advanced language processing abilities and
alignment with human cognition at the object level, current LLMs lack the
metacognitive mechanisms that enable humans to anticipate their memory
performance. These results highlight a fundamental difference between human and
AI cognition at the metacognitive level. Addressing this gap is crucial for
developing AI systems capable of effective self-monitoring and adaptation to
human needs, thereby enhancing human-AI interactions across domains such as
education and personalized learning.
comment: 28 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:2403.05152
☆ On the Use of Audio to Improve Dialogue Policies
With the significant progress of speech technologies, spoken goal-oriented
dialogue systems are becoming increasingly popular. One of the main modules of
a dialogue system is typically the dialogue policy, which is responsible for
determining system actions. This component usually relies only on audio
transcriptions, being strongly dependent on their quality and ignoring very
important extralinguistic information embedded in the user's speech. In this
paper, we propose new architectures to add audio information by combining
speech and text embeddings using a Double Multi-Head Attention component. Our
experiments show that audio embedding-aware dialogue policies outperform
text-based ones, particularly in noisy transcription scenarios, and that how
text and audio embeddings are combined is crucial to improve performance. We
obtained a 9.8% relative improvement in the User Request Score compared to an
only-text-based dialogue system on the DSTC2 dataset.
comment: IberSpeech 2024
☆ Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant
The development of large language models (LLMs) has significantly enhanced
the capabilities of multimodal LLMs (MLLMs) as general assistants. However,
lack of user-specific knowledge still restricts their application in human's
daily life. In this paper, we introduce the Retrieval Augmented Personalization
(RAP) framework for MLLMs' personalization. Starting from a general MLLM, we
turn it into a personalized assistant in three steps. (a) Remember: We design a
key-value database to store user-related information, e.g., user's name, avatar
and other attributes. (b) Retrieve: When the user initiates a conversation, RAP
will retrieve relevant information from the database using a multimodal
retriever. (c) Generate: The input query and retrieved concepts' information
are fed into MLLMs to generate personalized, knowledge-augmented responses.
Unlike previous methods, RAP allows real-time concept editing via updating the
external database. To further improve generation quality and alignment with
user-specific information, we design a pipeline for data collection and create
a specialized dataset for personalized training of MLLMs. Based on the dataset,
we train a series of MLLMs as personalized multimodal assistants. By
pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual
concepts without additional finetuning. Our models demonstrate outstanding
flexibility and generation quality across a variety of tasks, such as
personalized image captioning, question answering and visual recognition. The
code, data and models are available at https://github.com/Hoar012/RAP-MLLM.
☆ LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights
We present Legal Argument Reasoning (LAR), a novel task designed to evaluate
the legal reasoning capabilities of Large Language Models (LLMs). The task
requires selecting the correct next statement (from multiple choice options) in
a chain of legal arguments from court proceedings, given the facts of the case.
We constructed a dataset (LAR-ECHR) for this task using cases from the European
Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on
LAR-ECHR and found that (a) the ranking of the models is aligned with that of
LegalBench, an established US-based legal reasoning benchmark, even though
LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more
clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8%
accuracy on LAR-ECHR, indicating significant potential for further model
improvement. The process followed to construct LAR-ECHR can be replicated with
cases from other legal systems.
comment: Published in Natural Legal Language Processing (NLLP) 2024 workshop
☆ Representation Learning of Structured Data for Medical Foundation Models NeurIPS 2024
Vijay Prakash Dwivedi, Viktor Schlegel, Andy T. Liu, Thanh-Tung Nguyen, Abhinav Ramesh Kashyap, Jeng Wei, Wei-Hsian Yin, Stefan Winkler, Robby T. Tan
Large Language Models (LLMs) have demonstrated remarkable performance across
various domains, including healthcare. However, their ability to effectively
represent structured non-textual data, such as the alphanumeric medical codes
used in records like ICD-10 or SNOMED-CT, is limited and has been particularly
exposed in recent research. This paper examines the challenges LLMs face in
processing medical codes due to the shortcomings of current tokenization
methods. As a result, we introduce the UniStruct architecture to design a
multimodal medical foundation model of unstructured text and structured data,
which addresses these challenges by adapting subword tokenization techniques
specifically for the structured medical codes. Our approach is validated
through model pre-training on both an extensive internal medical database and a
public repository of structured medical records. Trained on over 1 billion
tokens on the internal medical database, the proposed model achieves up to a
23% improvement in evaluation metrics, with around 2% gain attributed to our
proposed tokenization. Additionally, when evaluated on the EHRSHOT public
benchmark with a 1/1000 fraction of the pre-training data, the UniStruct model
improves performance on over 42% of the downstream tasks. Our approach not only
enhances the representation and generalization capabilities of patient-centric
models but also bridges a critical gap in representation learning models'
ability to handle complex structured medical data, alongside unstructured text.
comment: NeurIPS 2024 Workshop on Unifying Representations in Neural Models
(UniReps 2024)
★ Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement
Large language models (LLMs) often face a bottleneck in inference speed due
to their reliance on auto-regressive decoding. Recently, parallel decoding has
shown significant promise in enhancing inference efficiency. However, we have
identified two key issues with existing parallel decoding frameworks: (1)
decoding heads fail to balance prediction accuracy and the parallelism of
execution, and (2) parallel decoding is not a universal solution, as it can
bring unnecessary overheads at some challenging decoding steps. To address
these issues, we propose Cerberus, an adaptive parallel decoding framework
introduces the gating mechanism to enable the LLMs to adaptively choose
appropriate decoding approaches at each decoding step, along with introducing a
new paradigm of decoding heads that introduce the sequential knowledge while
maintaining execution parallelism. The experiment results demonstrate that the
Cerberus can achieve up to 2.12x speed up compared to auto-regressive decoding,
and outperforms one of the leading parallel decoding frameworks, Medusa, with a
10% - 30% increase in acceleration and superior generation quality.
☆ Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models
Large Language Models (LLMs) have shown remarkable capabilities in various
natural language processing tasks. However, LLMs may rely on dataset biases as
shortcuts for prediction, which can significantly impair their robustness and
generalization capabilities. This paper presents Shortcut Suite, a
comprehensive test suite designed to evaluate the impact of shortcuts on LLMs'
performance, incorporating six shortcut types, five evaluation metrics, and
four prompting strategies. Our extensive experiments yield several key
findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream
tasks, significantly impairing their performance. 2) Larger LLMs are more
likely to utilize shortcuts under zero-shot and few-shot in-context learning
prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and
outperforms other prompting strategies, while few-shot prompts generally
underperform compared to zero-shot prompts. 4) LLMs often exhibit
overconfidence in their predictions, especially when dealing with datasets that
contain shortcuts. 5) LLMs generally have a lower explanation quality in
shortcut-laden datasets, with errors falling into three types: distraction,
disguised comprehension, and logical fallacy. Our findings offer new insights
for evaluating robustness and generalization in LLMs and suggest potential
directions for mitigating the reliance on shortcuts. The code is available at
\url {https://github.com/yyhappier/ShortcutSuite.git}.
☆ Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval
Retrieval-Augmented Generation (RAG) enhances language models by retrieving
and incorporating relevant external knowledge. However, traditional
retrieve-and-generate processes may not be optimized for real-world scenarios,
where queries might require multiple retrieval steps or none at all. In this
paper, we propose a Probing-RAG, which utilizes the hidden state
representations from the intermediate layers of language models to adaptively
determine the necessity of additional retrievals for a given query. By
employing a pre-trained prober, Probing-RAG effectively captures the model's
internal cognition, enabling reliable decision-making about retrieving external
documents. Experimental results across five open-domain QA datasets demonstrate
that Probing-RAG outperforms previous methods while reducing the number of
redundant retrieval steps.
comment: 6 figures, 13 tables
☆ Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems
Although large language models (LLMs) demonstrate impressive proficiency in
various tasks, they present potential safety risks, such as `jailbreaks', where
malicious inputs can coerce LLMs into generating harmful content. To address
these issues, many LLM developers have implemented various safety measures to
align these models. This alignment involves several techniques, including data
filtering during pre-training, supervised fine-tuning, reinforcement learning
from human feedback, and red-teaming exercises. These methods often introduce
deliberate and intentional biases similar to Political Correctness (PC) to
ensure the ethical behavior of LLMs. In this paper, we delve into the
intentional biases injected into LLMs for safety purposes and examine methods
to circumvent these safety alignment techniques. Notably, these intentional
biases result in a jailbreaking success rate in GPT-4o models that differs by
20% between non-binary and cisgender keywords and by 16% between white and
black keywords, even when the other parts of the prompts are identical. We
introduce the concept of PCJailbreak, highlighting the inherent risks posed by
these safety-induced biases. Additionally, we propose an efficient defense
method PCDefense, which prevents jailbreak attempts by injecting defense
prompts prior to generation. PCDefense stands as an appealing alternative to
Guard Models, such as Llama-Guard, that require additional inference cost after
text generation. Our findings emphasize the urgent need for LLM developers to
adopt a more responsible approach when designing and implementing safety
measures.
☆ Fine-Tuning Language Models on Multiple Datasets for Citation Intention Classification EMNLP 2024
Zeren Shui, Petros Karypis, Daniel S. Karls, Mingjian Wen, Saurav Manchanda, Ellad B. Tadmor, George Karypis
Citation intention Classification (CIC) tools classify citations by their
intention (e.g., background, motivation) and assist readers in evaluating the
contribution of scientific literature. Prior research has shown that pretrained
language models (PLMs) such as SciBERT can achieve state-of-the-art performance
on CIC benchmarks. PLMs are trained via self-supervision tasks on a large
corpus of general text and can quickly adapt to CIC tasks via moderate
fine-tuning on the corresponding dataset. Despite their advantages, PLMs can
easily overfit small datasets during fine-tuning. In this paper, we propose a
multi-task learning (MTL) framework that jointly fine-tunes PLMs on a dataset
of primary interest together with multiple auxiliary CIC datasets to take
advantage of additional supervision signals. We develop a data-driven task
relation learning (TRL) method that controls the contribution of auxiliary
datasets to avoid negative transfer and expensive hyper-parameter tuning. We
conduct experiments on three CIC datasets and show that fine-tuning with
additional datasets can improve the PLMs' generalization performance on the
primary dataset. PLMs fine-tuned with our proposed framework outperform the
current state-of-the-art models by 7% to 11% on small datasets while aligning
with the best-performing model on a large dataset.
comment: To be appear as a Findings paper at EMNLP 2024
☆ Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in
generating detailed and coherent responses from visual inputs. However, they
are prone to generate hallucinations due to an over-reliance on language
priors. To address this issue, we investigate the language priors in LVLMs and
make two key observations: (1) Even when predicting the tokens associated with
image-related part-of-speech (POS), models increasingly rely on linguistic
priors as the token sequences grow, thereby amplifying hallucinations. (2)
Methods that directly calibrate LVLM's output distribution to mitigate language
priors can lead to a degradation in text quality or even exacerbate
hallucinations. Based on these findings, we propose a novel method,
Summary-Guided Decoding (SGD). This method naturally encourages the model to
focus more on image information by reducing the text context through summaries,
while controlling only the image-related POS tokens to maintain text quality.
Through experiments, we demonstrate that SGD achieves state-of-the-art
performance on object hallucination benchmarks. Furthermore, in terms of the
trade-off between precision and recall, SGD achieves Pareto optimality among
the existing methods. Lastly, we observe that although existing methods
struggle to balance the reduction of object hallucinations with maintaining
text quality, SGD demonstrates robustness in handling this challenge.
☆ Computational Approaches to Arabic-English Code-Switching
Natural Language Processing (NLP) is a vital computational method for
addressing language processing, analysis, and generation. NLP tasks form the
core of many daily applications, from automatic text correction to speech
recognition. While significant research has focused on NLP tasks for the
English language, less attention has been given to Modern Standard Arabic and
Dialectal Arabic. Globalization has also contributed to the rise of
Code-Switching (CS), where speakers mix languages within conversations and even
within individual words (intra-word CS). This is especially common in Arab
countries, where people often switch between dialects or between dialects and a
foreign language they master. CS between Arabic and English is frequent in
Egypt, especially on social media. Consequently, a significant amount of
code-switched content can be found online. Such code-switched data needs to be
investigated and analyzed for several NLP tasks to tackle the challenges of
this multilingual phenomenon and Arabic language challenges. No work has been
done before for several integral NLP tasks on Arabic-English CS data. In this
work, we focus on the Named Entity Recognition (NER) task and other tasks that
help propose a solution for the NER task on CS data, e.g., Language
Identification. This work addresses this gap by proposing and applying
state-of-the-art techniques for Modern Standard Arabic and Arabic-English NER.
We have created the first annotated CS Arabic-English corpus for the NER task.
Also, we apply two enhancement techniques to improve the NER tagger on CS data
using CS contextual embeddings and data augmentation techniques. All methods
showed improvements in the performance of the NER taggers on CS data. Finally,
we propose several intra-word language identification approaches to determine
the language type of a mixed text and identify whether it is a named entity or
not.
comment: PhD thesis
☆ Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language EMNLP
This study introduces a prescriptive annotation benchmark grounded in
humanities research to ensure consistent, unbiased labeling of offensive
language, particularly for casual and non-mainstream language uses. We
contribute two newly annotated datasets that achieve higher inter-annotator
agreement between human and language model (LLM) annotations compared to
original datasets based on descriptive instructions. Our experiments show that
LLMs can serve as effective alternatives when professional annotators are
unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated
data outperform models trained on larger, single-source human-annotated
datasets. These findings highlight the value of structured guidelines in
reducing subjective variability, maintaining performance with limited data, and
embracing language diversity.
Content Warning: This article only analyzes offensive language for academic
purposes. Discretion is advised.
comment: 12 pages, 9 figures, EMNLP-NLP4DH 2024
☆ Reference-Based Post-OCR Processing with LLM for Diacritic Languages
Extracting fine-grained OCR text from aged documents in diacritic languages
remains challenging due to unexpected artifacts, time-induced degradation, and
lack of datasets. While standalone spell correction approaches have been
proposed, they show limited performance for historical documents due to
numerous possible OCR error combinations and differences between modern and
classical corpus distributions. We propose a method utilizing available
content-focused ebooks as a reference base to correct imperfect OCR-generated
text, supported by large language models. This technique generates
high-precision pseudo-page-to-page labels for diacritic languages, where small
strokes pose significant challenges in historical conditions. The pipeline
eliminates various types of noise from aged documents and addresses issues such
as missing characters, words, and disordered sequences. Our post-processing
method, which generated a large OCR dataset of classical Vietnamese books,
achieved a mean grading score of 8.72 on a 10-point scale. This outperformed
the state-of-the-art transformer-based Vietnamese spell correction model, which
scored 7.03 when evaluated on a sampled subset of the dataset. We also trained
a baseline OCR model to assess and compare it with well-known engines.
Experimental results demonstrate the strength of our baseline model compared to
widely used open-source solutions. The resulting dataset will be released
publicly to support future studies.
☆ Advancing Large Language Model Attribution through Self-Improving EMNLP 2024
Lei Huang, Xiaocheng Feng, Weitao Ma, Liang Zhao, Yuchun Fan, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
Teaching large language models (LLMs) to generate text with citations to
evidence sources can mitigate hallucinations and enhance verifiability in
information-seeking systems. However, improving this capability requires
high-quality attribution data, which is costly and labor-intensive. Inspired by
recent advances in self-improvement that enhance LLMs without manual
annotation, we present START, a Self-Taught AttRibuTion framework for
iteratively improving the attribution capability of LLMs. First, to prevent
models from stagnating due to initially insufficient supervision signals, START
leverages the model to self-construct synthetic training data for warming up.
To further self-improve the model's attribution ability, START iteratively
utilizes fine-grained preference supervision signals constructed from its
sampled responses to encourage robust, comprehensive, and attributable
generation. Experiments on three open-domain question-answering datasets,
covering long-form QA and multi-step reasoning, demonstrate significant
performance gains of 25.13% on average without relying on human annotations and
more advanced models. Further analysis reveals that START excels in aggregating
information across multiple sources.
comment: Accepted by EMNLP 2024 Main Conference
☆ Learning to Route with Confidence Tokens
Yu-Neng Chuang, Helen Zhou, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu
Large language models (LLMs) have demonstrated impressive performance on
several tasks and are increasingly deployed in real-world applications.
However, especially in high-stakes settings, it becomes vital to know when the
output of an LLM may be unreliable. Depending on whether an answer is
trustworthy, a system can then choose to route the question to another expert,
or otherwise fall back on a safe default behavior. In this work, we study the
extent to which LLMs can reliably indicate confidence in their answers, and how
this notion of confidence can translate into downstream accuracy gains. We
propose Self-REF, a lightweight training strategy to teach LLMs to express
confidence in whether their answers are correct in a reliable manner. Self-REF
introduces confidence tokens into the LLM, from which a confidence score can be
extracted. Compared to conventional approaches such as verbalizing confidence
and examining token probabilities, we demonstrate empirically that confidence
tokens show significant improvements in downstream routing and rejection
learning tasks.
☆ BANTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
Fabiha Haider, Fariha Tanjim Shifat, Md Farhan Ishmam, Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim, Md Farhad Alam
The proliferation of transliterated texts in digital spaces has emphasized
the need for detecting and classifying hate speech in languages beyond English,
particularly in low-resource languages. As online discourse can perpetuate
discrimination based on target groups, e.g. gender, religion, and origin,
multi-label classification of hateful content can help in comprehending hate
motivation and enhance content moderation. While previous efforts have focused
on monolingual or binary hate classification tasks, no work has yet addressed
the challenge of multi-label hate speech classification in transliterated
Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate
speech dataset comprising 37.3k samples. The samples are sourced from YouTube
comments, where each instance is labeled with one or more target groups,
reflecting the regional demographic. We establish novel transformer
encoder-based baselines by further pre-training on transliterated Bangla
corpus. We also propose a novel translation-based LLM prompting strategy for
transliterated text. Experiments reveal that our further pre-trained encoders
are achieving state-of-the-art performance on the BanTH dataset, while our
translation-based prompting outperforms other strategies in the zero-shot
setting. The introduction of BanTH not only fills a critical gap in hate speech
research for Bangla but also sets the stage for future exploration into
code-mixed and multi-label classification challenges in underrepresented
languages.
☆ SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Attention is the cornerstone of modern Large Language Models (LLMs). Yet its
quadratic complexity limits the efficiency and scalability of LLMs, especially
for those with a long-context window. A promising approach addressing this
limitation is to leverage the sparsity in attention. However, existing
sparsity-based solutions predominantly rely on predefined patterns or
heuristics to approximate sparsity. This practice falls short to fully capture
the dynamic nature of attention sparsity in language-based tasks. This paper
argues that attention sparsity should be learned rather than predefined. To
this end, we design SeerAttention, a new Attention mechanism that augments the
conventional attention with a learnable gate that adaptively selects
significant blocks in an attention map and deems the rest blocks sparse. Such
block-level sparsity effectively balances accuracy and speedup. To enable
efficient learning of the gating network, we develop a customized
FlashAttention implementation that extracts the block-level ground truth of
attention map with minimum overhead. SeerAttention not only applies to
post-training, but also excels in long-context fine-tuning. Our results show
that at post-training stages, SeerAttention significantly outperforms
state-of-the-art static or heuristic-based sparse attention methods, while also
being more versatile and flexible to adapt to varying context lengths and
sparsity ratios. When applied to long-context fine-tuning with YaRN,
SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k context
length with minimal perplexity loss, offering a 5.67x speedup over
FlashAttention-2.
☆ Breaking Chains: Unraveling the Links in Multi-Hop Knowledge Unlearning
Large language models (LLMs) serve as giant information stores, often
including personal or copyrighted data, and retraining them from scratch is not
a viable option. This has led to the development of various fast, approximate
unlearning techniques to selectively remove knowledge from LLMs. Prior research
has largely focused on minimizing the probabilities of specific token sequences
by reversing the language modeling objective. However, these methods still
leave LLMs vulnerable to adversarial attacks that exploit indirect references.
In this work, we examine the limitations of current unlearning techniques in
effectively erasing a particular type of indirect prompt: multi-hop queries.
Our findings reveal that existing methods fail to completely remove multi-hop
knowledge when one of the intermediate hops is unlearned. To address this
issue, we propose MUNCH, a simple uncertainty-based approach that breaks down
multi-hop queries into subquestions and leverages the uncertainty of the
unlearned model in final decision-making. Empirical results demonstrate the
effectiveness of our framework, and MUNCH can be easily integrated with
existing unlearning techniques, making it a flexible and useful solution for
enhancing unlearning processes.
comment: 16 pages, 5 figures
☆ Roadmap towards Superhuman Speech Understanding using Large Language Models
The success of large language models (LLMs) has prompted efforts to integrate
speech and audio data, aiming to create general foundation models capable of
processing both textual and non-textual inputs. Recent advances, such as
GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves
non-semantic information and world knowledge for deeper speech understanding.
To guide the development of speech LLMs, we propose a five-level roadmap,
ranging from basic automatic speech recognition (ASR) to advanced superhuman
models capable of integrating non-semantic information with abstract acoustic
knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark,
that standardizes critical aspects across various tasks in these five levels,
uncovering challenges in using abstract acoustic knowledge and completeness of
capability. Our findings reveal gaps in handling paralinguistic cues and
abstract acoustic knowledge, and we offer future directions. This paper
outlines a roadmap for advancing speech LLMs, introduces a benchmark for
evaluation, and provides key insights into their current limitations and
potential.
☆ CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models
Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, Monan Zhou, Jing Chen, Xuefeng Mu, Yuejie Gao, Yuanliang Dong, Jiafeng Liu, Xiaobing Li, Feng Yu, Maosong Sun
Challenges in managing linguistic diversity and integrating various musical
modalities are faced by current music information retrieval systems. These
limitations reduce their effectiveness in a global, multimodal music
environment. To address these issues, we introduce CLaMP 2, a system compatible
with 101 languages that supports both ABC notation (a text-based musical
notation format) and MIDI (Musical Instrument Digital Interface) for music
information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text
triplets, includes a multilingual text encoder and a multimodal music encoder
aligned via contrastive learning. By leveraging large language models, we
obtain refined and consistent multilingual descriptions at scale, significantly
reducing textual noise and balancing language distribution. Our experiments
show that CLaMP 2 achieves state-of-the-art results in both multilingual
semantic search and music classification across modalities, thus establishing a
new standard for inclusive and global music information retrieval.
comment: 17 pages, 10 figures, 4 tables
☆ From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition
We examine the language capabilities of language models (LMs) from the
critical perspective of human language acquisition. Building on classical
language development theories, we propose a three-stage framework to assess the
abilities of LMs, ranging from preliminary word understanding to complex
grammar and complex logical reasoning. Using this framework, we evaluate the
generative capacities of LMs using methods from linguistic research. Results
indicate that although recent LMs outperform earlier models in overall
performance, their developmental trajectory does not strictly follow the path
of human language acquisition. Notably, in generation tasks, LMs are more
similar to human performance in areas where information is easier to extract
from the corpus, such as average word length, clauses, and auxiliary verbs.
Newer LMs did not exhibit significant progress in terms of specific dimensions,
such as clauses and auxiliary verbs, where the variation across corpora is
relatively limited. Register theory offers a plausible explanation for these
observations, suggesting that the linguistic features of the training data have
a substantial impact on the models' abilities.
☆ A Systematic Investigation of Knowledge Retrieval and Selection for Retrieval Augmented Generation
Retrieval-augmented generation (RAG) has emerged as a powerful method for
enhancing natural language generation by integrating external knowledge into a
model's output. While prior work has demonstrated the importance of improving
knowledge retrieval for boosting generation quality, the role of knowledge
selection remains less clear. In this paper, we perform a comprehensive
analysis of how knowledge retrieval and selection influence downstream
generation performance in RAG systems. By simulating different retrieval and
selection conditions through a controlled mixture of gold and distractor
knowledge, we assess the impact of these factors on generation outcomes. Our
findings indicate that the downstream generator model's capability, as well as
the complexity of the task and dataset, significantly influence the impact of
knowledge retrieval and selection on the overall RAG system performance. In
typical scenarios, improving the knowledge recall score is key to enhancing
generation outcomes, with the knowledge selector providing a limited additional
benefit when a strong generator model is used on clear, well-defined tasks. For
weaker generator models or more ambiguous tasks and datasets, the knowledge F1
score becomes a critical factor, and the knowledge selector plays a more
prominent role in improving overall performance.
☆ Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works
This paper investigates the application of translation alignment algorithms
in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni's
Italian novel "I promessi sposi" ("The Betrothed"), with translations in eight
languages (English, Spanish, French, German, Dutch, Polish, Russian and
Chinese) from the 19th and 20th centuries. We identify key requirements for the
MDE to improve both the reader experience and support for translation studies.
Our research highlights the limitations of current state-of-the-art algorithms
when applied to the translation of literary texts and outlines an automated
pipeline for MDE creation. This pipeline transforms raw texts into web-based,
side-by-side representations of original and translated texts with different
rendering options. In addition, we propose new metrics for evaluating the
alignment of literary translations and suggest visualization techniques for
future analysis.
comment: 18 pages, Computational Humanities Research Conference, December 4-6,
2024, Aarhus, Denmark
☆ Disentangling Likes and Dislikes in Personalized Generative Explainable Recommendation
Ryotaro Shimizu, Takashi Wada, Yu Wang, Johannes Kruse, Sean O'Brien, Sai HtaungKham, Linxin Song, Yuya Yoshikawa, Yuki Saito, Fugee Tsung, Masayuki Goto, Julian McAuley
Recent research on explainable recommendation generally frames the task as a
standard text generation problem, and evaluates models simply based on the
textual similarity between the predicted and ground-truth explanations.
However, this approach fails to consider one crucial aspect of the systems:
whether their outputs accurately reflect the users' (post-purchase) sentiments,
i.e., whether and why they would like and/or dislike the recommended items. To
shed light on this issue, we introduce new datasets and evaluation methods that
focus on the users' sentiments. Specifically, we construct the datasets by
explicitly extracting users' positive and negative opinions from their
post-purchase reviews using an LLM, and propose to evaluate systems based on
whether the generated explanations 1) align well with the users' sentiments,
and 2) accurately identify both positive and negative opinions of users on the
target items. We benchmark several recent models on our datasets and
demonstrate that achieving strong performance on existing metrics does not
ensure that the generated explanations align well with the users' sentiments.
Lastly, we find that existing models can provide more sentiment-aware
explanations when the users' (predicted) ratings for the target items are
directly fed into the models as input. We will release our code and datasets
upon acceptance.
☆ Atomic Calibration of LLMs in Long-Form Generations
Large language models (LLMs) often suffer from hallucinations, posing
significant challenges for real-world applications. Confidence calibration,
which estimates the underlying uncertainty of model predictions, is essential
to enhance the LLMs' trustworthiness. Existing research on LLM calibration has
primarily focused on short-form tasks, providing a single confidence score at
the response level (macro calibration). However, this approach is insufficient
for long-form generations, where responses often contain more complex
statements and may include both accurate and inaccurate information. Therefore,
we introduce atomic calibration, a novel approach that evaluates factuality
calibration at a fine-grained level by breaking down long responses into atomic
claims. We classify confidence elicitation methods into discriminative and
generative types and demonstrate that their combination can enhance
calibration. Our extensive experiments on various LLMs and datasets show that
atomic calibration is well-suited for long-form generation and can also improve
macro calibration results. Additionally, atomic calibration reveals insightful
patterns in LLM confidence throughout the generation process.
☆ Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis
Language Confusion is a phenomenon where Large Language Models (LLMs)
generate text that is neither in the desired language, nor in a contextually
appropriate language. This phenomenon presents a critical challenge in text
generation by LLMs, often appearing as erratic and unpredictable behavior. We
hypothesize that there are linguistic regularities to this inherent
vulnerability in LLMs and shed light on patterns of language confusion across
LLMs. We introduce a novel metric, Language Confusion Entropy, designed to
directly measure and quantify this confusion, based on language distributions
informed by linguistic typology and lexical variation. Comprehensive
comparisons with the Language Confusion Benchmark (Marchisio et al., 2024)
confirm the effectiveness of our metric, revealing patterns of language
confusion across LLMs. We further link language confusion to LLM security, and
find patterns in the case of multilingual embedding inversion attacks. Our
analysis demonstrates that linguistic typology offers theoretically grounded
interpretation, and valuable insights into leveraging language similarities as
a prior for LLM alignment and security.
comment: 17 pages, 6 figures, 14 tables
☆ SPIN: Self-Supervised Prompt INjection
Large Language Models (LLMs) are increasingly used in a variety of important
applications, yet their safety and reliability remain as major concerns.
Various adversarial and jailbreak attacks have been proposed to bypass the
safety alignment and cause the model to produce harmful responses. We introduce
Self-supervised Prompt INjection (SPIN) which can detect and reverse these
various attacks on LLMs. As our self-supervised prompt defense is done at
inference-time, it is also compatible with existing alignment and adds an
additional layer of safety for defense. Our benchmarks demonstrate that our
system can reduce the attack success rate by up to 87.9%, while maintaining the
performance on benign user requests. In addition, we discuss the situation of
an adaptive attacker and show that our method is still resilient against
attackers who are aware of our defense.
☆ Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, Jinyoung Yeo
Large language models (LLMs) have recently gained much attention in building
autonomous agents. However, the performance of current LLM-based web agents in
long-horizon tasks is far from optimal, often yielding errors such as
repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid
such an irreversible mistake, as we have an awareness of the potential outcomes
(e.g., losing money) of our actions, also known as the "world model". Motivated
by this, our study first starts with preliminary analyses, confirming the
absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet,
etc.). Then, we present a World-model-augmented (WMA) web agent, which
simulates the outcomes of its actions for better decision-making. To overcome
the challenges in training LLMs as world models predicting next observations,
such as repeated elements across observations and long HTML inputs, we propose
a transition-focused observation abstraction, where the prediction objectives
are free-form natural language descriptions exclusively highlighting important
state differences between time steps. Experiments on WebArena and Mind2Web show
that our world models improve agents' policy selection without training and
demonstrate our agents' cost- and time-efficiency compared to recent
tree-search-based agents.
comment: Work in progress
☆ Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning
Reasoning is a fundamental substrate for solving novel and complex problems.
Deliberate efforts in learning and developing frameworks around System 2
reasoning have made great strides, yet problems of sufficient complexity remain
largely out of reach for open models. To address this gap, we examine the
potential of Generative Flow Networks as a fine-tuning method for LLMs to
unlock advanced reasoning capabilities. In this paper, we present a proof of
concept in the domain of formal reasoning, specifically in the Neural Theorem
Proving (NTP) setting, where proofs specified in a formal language such as Lean
can be deterministically and objectively verified. Unlike classical
reward-maximization reinforcement learning, which frequently over-exploits
high-reward actions and fails to effectively explore the state space, GFlowNets
have emerged as a promising approach for sampling compositional objects,
improving generalization, and enabling models to maintain diverse hypotheses.
Our early results demonstrate GFlowNet fine-tuning's potential for enhancing
model performance in a search setting, which is especially relevant given the
paradigm shift towards inference time compute scaling and "thinking slowly."
☆ CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
Mian Zhang, Xianjun Yang, Xinlu Zhang, Travis Labrum, Jamie C. Chiu, Shaun M. Eack, Fei Fang, William Yang Wang, Zhiyu Zoey Chen
There is a significant gap between patient needs and available mental health
support today. In this paper, we aim to thoroughly examine the potential of
using Large Language Models (LLMs) to assist professional psychotherapy. To
this end, we propose a new benchmark, CBT-BENCH, for the systematic evaluation
of cognitive behavioral therapy (CBT) assistance. We include three levels of
tasks in CBT-BENCH: I: Basic CBT knowledge acquisition, with the task of
multiple-choice questions; II: Cognitive model understanding, with the tasks of
cognitive distortion classification, primary core belief classification, and
fine-grained core belief classification; III: Therapeutic response generation,
with the task of generating responses to patient speech in CBT therapy
sessions. These tasks encompass key aspects of CBT that could potentially be
enhanced through AI assistance, while also outlining a hierarchy of capability
requirements, ranging from basic knowledge recitation to engaging in real
therapeutic conversations. We evaluated representative LLMs on our benchmark.
Experimental results indicate that while LLMs perform well in reciting CBT
knowledge, they fall short in complex real-world scenarios requiring deep
analysis of patients' cognitive structures and generating effective responses,
suggesting potential future work.
☆ Anchored Alignment for Self-Explanations Enhancement
In this work, we introduce a methodology for alignment designed to enhance
the ability of large language models (LLMs) to articulate their reasoning
(self-explanation) even in the absence of annotated rationale explanations. Our
alignment methodology comprises three key components: explanation quality
assessment, self-instruction dataset generation, and model alignment.
Additionally, we present a novel technique called Alignment with Anchor
Preference Pairs, which improves the selection of preference pairs by
categorizing model outputs into three groups: consistently correct,
consistently incorrect, and variable. By applying tailored strategies to each
category, we enhance the effectiveness of Direct Preference Optimization (DPO).
Our experimental results demonstrate that this approach significantly improves
explanation quality while maintaining accuracy compared to other fine-tuning
strategies.
☆ FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad
Summarization is one of the most common tasks performed by large language
models (LLMs), especially in applications like Retrieval-Augmented Generation
(RAG). However, existing evaluations of hallucinations in LLM-generated
summaries, and evaluations of hallucination detection models both suffer from a
lack of diversity and recency in the LLM and LLM families considered. This
paper introduces FaithBench, a summarization hallucination benchmark comprising
challenging hallucinations made by 10 modern LLMs from 8 different families,
with ground truth annotations by human experts. ``Challenging'' here means
summaries on which popular, state-of-the-art hallucination detection models,
including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and
GPT-3.5-Turbo produce the least hallucinations. However, even the best
hallucination detection models have near 50\% accuracies on FaithBench,
indicating lots of room for future improvement. The repo is
https://github.com/vectara/FaithBench
☆ BQA: Body Language Question Answering Dataset for Video Large Language Models
A large part of human communication relies on nonverbal cues such as facial
expressions, eye contact, and body language. Unlike language or sign language,
such nonverbal communication lacks formal rules, requiring complex reasoning
based on commonsense understanding. Enabling current Video Large Language
Models (VideoLLMs) to accurately interpret body language is a crucial
challenge, as human unconscious actions can easily cause the model to
misinterpret their intent. To address this, we propose a dataset, BQA, a body
language question answering dataset, to validate whether the model can
correctly interpret emotions from short clips of body language comprising 26
emotion labels of videos of body language. We evaluated various VideoLLMs on
BQA and revealed that understanding body language is challenging, and our
analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made
significantly biased answers depending on the age group and ethnicity of the
individuals in the video. The dataset is available.
☆ Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
There is an increasing interest in using language models (LMs) for automated
decision-making, with multiple countries actively testing LMs to aid in
military crisis decision-making. To scrutinize relying on LM decision-making in
high-stakes settings, we examine the inconsistency of responses in a crisis
simulation ("wargame"), similar to reported tests conducted by the US military.
Prior work illustrated escalatory tendencies and varying levels of aggression
among LMs but were constrained to simulations with pre-defined actions. This
was due to the challenges associated with quantitatively measuring semantic
differences and evaluating natural language decision-making without relying on
pre-defined actions. In this work, we query LMs for free form responses and use
a metric based on BERTScore to measure response inconsistency quantitatively.
Leveraging the benefits of BERTScore, we show that the inconsistency metric is
robust to linguistic variations that preserve semantic meaning in a
question-answering setting across text lengths. We show that all five tested
LMs exhibit levels of inconsistency that indicate semantic differences, even
when adjusting the wargame setting, anonymizing involved conflict countries, or
adjusting the sampling temperature parameter $T$. Further qualitative
evaluation shows that models recommend courses of action that share few to no
similarities. We also study the impact of different prompt sensitivity
variations on inconsistency at temperature $T = 0$. We find that inconsistency
due to semantically equivalent prompt variations can exceed response
inconsistency from temperature sampling for most studied models across
different levels of ablations. Given the high-stakes nature of military
deployment, we recommend further consideration be taken before using LMs to
inform military decisions or other cases of high-stakes decision-making.
☆ Meta-DiffuB: A Contextualized Sequence-to-Sequence Text Diffusion Model with Meta-Exploration
The diffusion model, a new generative modeling paradigm, has achieved
significant success in generating images, audio, video, and text. It has been
adapted for sequence-to-sequence text generation (Seq2Seq) through DiffuSeq,
termed S2S Diffusion. Existing S2S-Diffusion models predominantly rely on fixed
or hand-crafted rules to schedule noise during the diffusion and denoising
processes. However, these models are limited by non-contextualized noise, which
fails to fully consider the characteristics of Seq2Seq tasks. In this paper, we
propose the Meta-DiffuB framework - a novel scheduler-exploiter S2S-Diffusion
paradigm designed to overcome the limitations of existing S2S-Diffusion models.
We employ Meta-Exploration to train an additional scheduler model dedicated to
scheduling contextualized noise for each sentence. Our exploiter model, an
S2S-Diffusion model, leverages the noise scheduled by our scheduler model for
updating and generation. Meta-DiffuB achieves state-of-the-art performance
compared to previous S2S-Diffusion models and fine-tuned pre-trained language
models (PLMs) across four Seq2Seq benchmark datasets. We further investigate
and visualize the impact of Meta-DiffuB's noise scheduling on the generation of
sentences with varying difficulties. Additionally, our scheduler model can
function as a "plug-and-play" model to enhance DiffuSeq without the need for
fine-tuning during the inference stage.
☆ Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh, Mohammad Sadegh Rasooli, Michael Levit, Peidong Wang, Jian Xue, Dinesh Manocha, Jinyu Li
Generative Error Correction (GEC) has emerged as a powerful post-processing
method to enhance the performance of Automatic Speech Recognition (ASR)
systems. However, we show that GEC models struggle to generalize beyond the
specific types of errors encountered during training, limiting their ability to
correct new, unseen errors at test time, particularly in out-of-domain (OOD)
scenarios. This phenomenon amplifies with named entities (NEs), where, in
addition to insufficient contextual information or knowledge about the NEs,
novel NEs keep emerging. To address these issues, we propose DARAG (Data- and
Retrieval-Augmented Generative Error Correction), a novel approach designed to
improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC
training dataset with synthetic data generated by prompting LLMs and
text-to-speech models, thereby simulating additional errors from which the
model can learn. For OOD scenarios, we simulate test-time errors from new
domains similarly and in an unsupervised fashion. Additionally, to better
handle named entities, we introduce retrieval-augmented correction by
augmenting the input with entities retrieved from a database. Our approach is
simple, scalable, and both domain- and language-agnostic. We experiment on
multiple datasets and settings, showing that DARAG outperforms all our
baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% --
33\% improvements in OOD settings.
comment: Preprint. Under Review
☆ The Geometry of Numerical Reasoning: Language Models Compare Numeric Properties in Linear Subspaces
This paper investigates whether large language models (LLMs) utilize
numerical attributes encoded in a low-dimensional subspace of the embedding
space when answering logical comparison questions (e.g., Was Cristiano born
before Messi?). We first identified these subspaces using partial least squares
regression, which effectively encodes the numerical attributes associated with
the entities in comparison prompts. Further, we demonstrate causality by
intervening in these subspaces to manipulate hidden states, thereby altering
the LLM's comparison outcomes. Experimental results show that our findings hold
for different numerical attributes, indicating that LLMs utilize the linearly
encoded information for numerical reasoning.
☆ Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
In retrieval-augmented generation systems, the integration of self-generated
documents (SGDs) alongside retrieved content has emerged as a promising
strategy for enhancing the performance of large language model. However,
previous research primarily focuses on optimizing the use of SGDs, with the
inherent properties of SGDs remaining underexplored. Therefore, this paper
conducts a comprehensive analysis of different types of SGDs and experiments on
various knowledge-intensive tasks. We develop a taxonomy of SGDs grounded in
Systemic Functional Linguistics (SFL) to compare the influence of different SGD
categories. Our findings offer key insights into what kinds of SGDs most
effectively contribute to improving LLM's performance. The results and further
fusion methods based on SGD categories also provide practical guidelines for
taking better advantage of SGDs to achieve significant advancements in
knowledge-driven QA tasks with RAG.
comment: Under Review
☆ MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Automatic question generation (QG) is essential for AI and NLP, particularly
in intelligent tutoring, dialogue systems, and fact verification. Generating
multiple-choice questions (MCQG) for professional exams, like the United States
Medical Licensing Examination (USMLE), is particularly challenging, requiring
domain expertise and complex multi-hop reasoning for high-quality questions.
However, current large language models (LLMs) like GPT-4 struggle with
professional MCQG due to outdated knowledge, hallucination issues, and prompt
sensitivity, resulting in unsatisfactory quality and difficulty. To address
these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique
and Correction) framework for converting medical cases into high-quality
USMLE-style questions. By integrating expert-driven prompt engineering with
iterative self-critique and self-correction feedback, MCQG-SRefine
significantly enhances human expert satisfaction regarding both the quality and
difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based
automatic metric to replace the complex and costly expert evaluation process,
ensuring reliable and expert-aligned assessments.
comment: Equal contribution for the first two authors
☆ aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Completion
Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Ge Li
Large Language Models (LLMs) have been widely used in code completion, and
researchers are focusing on scaling up LLMs to improve their accuracy. However,
larger LLMs will increase the response time of code completion and decrease the
developers' productivity. In this paper, we propose a lightweight and effective
LLM for code completion named aiXcoder-7B. Compared to existing LLMs,
aiXcoder-7B achieves higher code completion accuracy while having smaller
scales (i.e., 7 billion parameters). We attribute the superiority of
aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three
training objectives, one of which is our proposed Structured Fill-In-the-Middle
(SFIM). SFIM considers the syntax structures in code and effectively improves
the performance of LLMs for code. (2) Diverse data sampling strategies. They
consider inter-file relationships and enhance the capability of LLMs in
understanding cross-file contexts. (3) Extensive high-quality data. We
establish a rigorous data collection pipeline and consume a total of 1.2
trillion unique tokens for training aiXcoder-7B. This vast volume of data
enables aiXcoder-7B to learn a broad distribution of code. We evaluate
aiXcoder-7B in five popular code completion benchmarks and a new benchmark
collected by this paper. The results show that aiXcoder-7B outperforms the
latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g.,
StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and
effective LLM for academia and industry. Finally, we summarize three valuable
insights for helping practitioners train the next generations of LLMs for code.
aiXcoder-7B has been open-souced and gained significant attention. As of the
submission date, aiXcoder-7B has received 2,193 GitHub Stars.
comment: aiXcoder-7B is available at
https://github.com/aixcoder-plugin/aiXcoder-7B/tree/main
☆ Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents
Long Li, Weiwen Xu, Jiayan Guo, Ruochen Zhao, Xinxuan Li, Yuqian Yuan, Boqiang Zhang, Yuming Jiang, Yifei Xin, Ronghao Dang, Deli Zhao, Yu Rong, Tian Feng, Lidong Bing
Effective research ideation is a critical step for scientific research.
However, the exponential increase in scientific literature makes it challenging
for researchers to stay current with recent advances and identify meaningful
research directions. Recent developments in large language models~(LLMs)
suggest a promising avenue for automating the generation of novel research
ideas. However, existing methods for idea generation either trivially prompt
LLMs or directly expose LLMs to extensive literature without indicating useful
information. Inspired by the research process of human researchers, we propose
a Chain-of-Ideas~(CoI) agent, an LLM-based agent that organizes relevant
literature in a chain structure to effectively mirror the progressive
development in a research domain. This organization facilitates LLMs to capture
the current advancements in research, thereby enhancing their ideation
capabilities. Furthermore, we propose Idea Arena, an evaluation protocol that
can comprehensively evaluate idea generation methods from different
perspectives, aligning closely with the preferences of human researchers.
Experimental results indicate that the CoI agent consistently outperforms other
methods and shows comparable quality as humans in research idea generation.
Moreover, our CoI agent is budget-friendly, with a minimum cost of \$0.50 to
generate a candidate idea and its corresponding experimental design.
comment: 10 pages,5 figures, conference
☆ Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
Traditional transformer models often allocate a fixed amount of computational
resources to every input token, leading to inefficient and unnecessary
computation. To address this, the Mixture of Depths (MoD) was introduced to
dynamically adjust the computational depth by skipping less important layers.
Despite its promise, current MoD approaches remain under-explored and face two
main challenges: (1) \textit{high training costs due to the need to train the
entire model along with the routers that determine which layers to skip}, and
(2) \textit{the risk of performance degradation when important layers are
bypassed}. In response to the first issue, we propose Router-Tuning, a method
that fine-tunes only the router on a small dataset, drastically reducing the
computational overhead associated with full model training. For the second
challenge, we propose MindSkip, which deploys \textit{Attention with Dynamic
Depths}. This method preserves the model's performance while significantly
enhancing computational and memory efficiency. Extensive experiments
demonstrate that our approach delivers competitive results while dramatically
improving the computation efficiency, e.g., 21\% speedup and only a 0.2\%
performance drop. The code is released at
\url{https://github.com/CASE-Lab-UMD/Router-Tuning}.
☆ AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning EMNLP 2024
Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Recent advancements in large language models (LLMs) have been remarkable.
Users face a choice between using cloud-based LLMs for generation quality and
deploying local-based LLMs for lower computational cost. The former option is
typically costly and inefficient, while the latter usually fails to deliver
satisfactory performance for reasoning steps requiring deliberate thought
processes. In this work, we propose a novel LLM utilization paradigm that
facilitates the collaborative operation of large cloud-based LLMs and smaller
local-deployed LLMs. Our framework comprises two primary modules: the local
agent instantiated with a relatively smaller LLM, handling less complex
reasoning steps, and the cloud agent equipped with a larger LLM, managing more
intricate reasoning steps. This collaborative processing is enabled through an
adaptive mechanism where the local agent introspectively identifies errors and
proactively seeks assistance from the cloud agent, thereby effectively
integrating the strengths of both locally-deployed and cloud-based LLMs,
resulting in significant enhancements in task completion performance and
efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from
mathematical reasoning and complex question answering, using various types of
LLMs to instantiate the local and cloud agents. The empirical results show that
AdaSwitch effectively improves the performance of the local agent, and
sometimes achieves competitive results compared to the cloud agent while
utilizing much less computational overhead.
comment: EMNLP 2024 Main Conference
☆ EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic
Modeling), a novel self-supervised learning approach for speech representation
learning. In contrast to the prior methods that use random masking schemes for
Masked Acoustic Modeling (MAM), we introduce a novel selective and adaptive
masking strategy. Specifically, during SSL training, we progressively introduce
harder regions to the model for reconstruction. Our approach automatically
selects hard regions and is built on the observation that the reconstruction
loss of individual frames in MAM can provide natural signals to judge the
difficulty of solving the MAM pre-text task for that frame. To identify these
hard regions, we employ a teacher model that first predicts the frame-wise
losses and then decides which frames to mask. By learning to create challenging
problems, such as identifying harder frames and solving them simultaneously,
the model is able to learn more effective representations and thereby acquire a
more comprehensive understanding of the speech. Quantitatively, EH-MAM
outperforms several state-of-the-art baselines across various low-resource
speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a
thorough analysis to show that the regions masked by EH-MAM effectively capture
useful context across speech frames.
☆ An Evolved Universal Transformer Memory
Prior methods propose to offset the escalating costs of modern foundation
models by dropping specific parts of their contexts with hand-designed rules,
while attempting to preserve their original performance. We overcome this
trade-off with Neural Attention Memory Models (NAMMs), introducing a learned
network for memory management that improves both the performance and efficiency
of transformers. We evolve NAMMs atop pre-trained transformers to provide
different latent contexts focusing on the most relevant information for
individual layers and attention heads.NAMMs are universally applicable to any
model using self-attention as they condition exclusively on the values in the
produced attention matrices. Learning NAMMs on a small set of problems, we
achieve substantial performance improvements across multiple long-context
benchmarks while cutting the model's input contexts up to a fraction of the
original sizes. We show the generality of our conditioning enables zero-shot
transfer of NAMMs trained only on language to entirely new transformer
architectures even across input modalities, with their benefits carrying over
to vision and reinforcement learning.
comment: 29 pages, 14 figures. Preprint, under submission. Source code is
available at https://github.com/SakanaAI/evo-memory
☆ SLM-Mod: Small Language Models Surpass LLMs at Content Moderation
Large language models (LLMs) have shown promise in many natural language
understanding tasks, including content moderation. However, these models can be
expensive to query in real-time and do not allow for a community-specific
approach to content moderation. To address these challenges, we explore the use
of open-source small language models (SLMs) for community-specific content
moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by
comparing their performance against much larger open- and closed-sourced
models. Using 150K comments from 15 popular Reddit communities, we find that
SLMs outperform LLMs at content moderation -- 11.5% higher accuracy and 25.7%
higher recall on average across all communities. We further show the promise of
cross-community content moderation, which has implications for new communities
and the development of cross-platform moderation techniques. Finally, we
outline directions for future work on language model based content moderation.
Code and links to HuggingFace models can be found at
https://github.com/AGoyal0512/SLM-Mod.
comment: Preprint: 15 pages, 8 figures, 8 pages
☆ Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings
Large Language Models (LLMs) are trained on massive amounts of data, enabling
their application across diverse domains and tasks. Despite their remarkable
performance, most LLMs are developed and evaluated primarily in English.
Recently, a few multi-lingual LLMs have emerged, but their performance in
low-resource languages, especially the most spoken languages in South Asia, is
less explored. To address this gap, in this study, we evaluate LLMs such as
GPT-4, Llama 2, and Gemini to analyze their effectiveness in English compared
to other low-resource languages from South Asia (e.g., Bangla, Hindi, and
Urdu). Specifically, we utilized zero-shot prompting and five different prompt
settings to extensively investigate the effectiveness of the LLMs in
cross-lingual translated prompts. The findings of the study suggest that GPT-4
outperformed Llama 2 and Gemini in all five prompt settings and across all
languages. Moreover, all three LLMs performed better for English language
prompts than other low-resource language prompts. This study extensively
investigates LLMs in low-resource language contexts to highlight the
improvements required in LLMs and language-specific resources to develop more
generally purposed NLP applications.
☆ Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead NAACL 2025
As Vision Language Models (VLMs) gain widespread use, their fairness remains
under-explored. In this paper, we analyze demographic biases across five models
and six datasets. We find that portrait datasets like UTKFace and CelebA are
the best tools for bias detection, finding gaps in performance and fairness
between LLaVa and CLIP models. However, scene based datasets like PATA,
VLStereoSet fail to be useful benchmarks for bias due to their construction. As
for pronoun based datasets like VisoGender, we receive mixed signals as only
some subsets of the data are useful in providing insights. To alleviate this
problem, we introduce a more difficult version of VisoGender to serve as a more
rigorous evaluation. Based on these results, we call for more effective and
carefully designed datasets to ensure VLMs are both fair and reliable.
comment: Under Review at NAACL 2025
☆ Data Defenses Against Large Language Models
Large language models excel at performing inference over text to extract
information, summarize information, or generate additional text. These
inference capabilities are implicated in a variety of ethical harms spanning
surveillance, labor displacement, and IP/copyright theft. While many policy,
legal, and technical mitigations have been proposed to counteract these harms,
these mitigations typically require cooperation from institutions that move
slower than technical advances (i.e., governments) or that have few incentives
to act to counteract these harms (i.e., the corporations that create and profit
from these LLMs). In this paper, we define and build "data defenses" -- a novel
strategy that directly empowers data owners to block LLMs from performing
inference on their data. We create data defenses by developing a method to
automatically generate adversarial prompt injections that, when added to input
text, significantly reduce the ability of LLMs to accurately infer personally
identifying information about the subject of the input text or to use
copyrighted text in inference. We examine the ethics of enabling such direct
resistance to LLM inference, and argue that making data defenses that resist
and subvert LLMs enables the realization of important values such as data
ownership, data sovereignty, and democratic control over AI systems. We verify
that our data defenses are cheap and fast to generate, work on the latest
commercial and open-source LLMs, resistance to countermeasures, and are robust
to several different attack settings. Finally, we consider the security
implications of LLM data defenses and outline several future research
directions in this area. Our code is available at
https://github.com/wagnew3/LLMDataDefenses and a tool for using our defenses to
protect text against LLM inference is at
https://wagnew3.github.io/LLM-Data-Defenses/.
☆ Retrieval-Enhanced Named Entity Recognition
When combined with In-Context Learning, a technique that enables models to
adapt to new tasks by incorporating task-specific examples or demonstrations
directly within the input prompt, autoregressive language models have achieved
good performance in a wide range of tasks and applications. However, this
combination has not been properly explored in the context of named entity
recognition, where the structure of this task poses unique challenges. We
propose RENER (Retrieval-Enhanced Named Entity Recognition), a technique for
named entity recognition using autoregressive language models based on
In-Context Learning and information retrieval techniques. When presented with
an input text, RENER fetches similar examples from a dataset of training
examples that are used to enhance a language model to recognize named entities
from this input text. RENER is modular and independent of the underlying
language model and information retrieval algorithms. Experimental results show
that in the CrossNER collection we achieve state-of-the-art performance with
the proposed technique and that information retrieval can increase the F-score
by up to 11 percentage points.
comment: 13 pages, 6 figures, 3 tables
☆ Learning to Summarize from LLM-generated Feedback
Developing effective text summarizers remains a challenge due to issues like
hallucinations, key information omissions, and verbosity in LLM-generated
summaries. This work explores using LLM-generated feedback to improve summary
quality by aligning the summaries with human preferences for faithfulness,
completeness, and conciseness. We introduce FeedSum, a large-scale dataset
containing multi-dimensional LLM feedback on summaries of varying quality
across diverse domains. Our experiments show how feedback quality,
dimensionality, and granularity influence preference learning, revealing that
high-quality, multi-dimensional, fine-grained feedback significantly improves
summary generation. We also compare two methods for using this feedback:
supervised fine-tuning and direct preference optimization. Finally, we
introduce SummLlama3-8b, a model that outperforms the nearly 10x larger
Llama3-70b-instruct in generating human-preferred summaries, demonstrating that
smaller models can achieve superior performance with appropriate training. The
full dataset will be released soon. The SummLlama3-8B model is now available at
https://huggingface.co/DISLab/SummLlama3-8B.
☆ Controllable Generation via Locally Constrained Resampling
Autoregressive models have demonstrated an unprecedented ability at modeling
the intricacies of natural language. However, they continue to struggle with
generating complex outputs that adhere to logical constraints. Sampling from a
fully-independent distribution subject to a constraint is hard. Sampling from
an autoregressive distribution subject to a constraint is doubly hard: We have
to contend not only with the hardness of the constraint but also the
distribution's lack of structure. We propose a tractable probabilistic approach
that performs Bayesian conditioning to draw samples subject to a constraint.
Our approach considers the entire sequence, leading to a more globally optimal
constrained generation than current greedy methods. Starting from a model
sample, we induce a local, factorized distribution which we can tractably
condition on the constraint. To generate samples that satisfy the constraint,
we sample from the conditional distribution, correct for biases in the samples
and resample. The resulting samples closely approximate the target distribution
and are guaranteed to satisfy the constraints. We evaluate our approach on
several tasks, including LLM detoxification and solving Sudoku puzzles. We show
that by disallowing a list of toxic expressions our approach is able to steer
the model's outputs away from toxic generations, outperforming similar
approaches to detoxification. We conclude by showing that our approach achieves
a perfect accuracy on Sudoku compared to <50% for GPT4-o and Gemini 1.5.
comment: arXiv admin note: text overlap with arXiv:2312.03905
☆ A Little Human Data Goes A Long Way
Faced with an expensive human annotation process, creators of NLP systems
increasingly turn to synthetic data generation. While this method shows
promise, the extent to which synthetic data can replace human annotation is
poorly understood. We investigate the use of synthetic data in Fact
Verification (FV) and Question Answering (QA) by studying the effects of
incrementally replacing human generated data with synthetic points on eight
diverse datasets. Strikingly, replacing up to 90% of the training data only
marginally decreases performance, but replacing the final 10% leads to severe
declines. We find that models trained on purely synthetic data can be reliably
improved by including as few as 125 human generated data points. We show that
matching the performance gain of just a little additional human data (only 200
points) requires an order of magnitude more synthetic data and estimate price
ratios at which human annotation would be a more cost-effective solution. Our
results suggest that even when human annotation at scale is infeasible, there
is great value to having a small proportion of the dataset being human
generated.
♻ ☆ Towards Multilingual LLM Evaluation for European Languages
Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, Mehdi Ali
The rise of Large Language Models (LLMs) has revolutionized natural language
processing across numerous languages and tasks. However, evaluating LLM
performance in a consistent and meaningful way across multiple European
languages remains challenging, especially due to the scarcity of
language-parallel multilingual benchmarks. We introduce a multilingual
evaluation approach tailored for European languages. We employ translated
versions of five widely-used benchmarks to assess the capabilities of 40 LLMs
across 21 European languages. Our contributions include examining the
effectiveness of translated benchmarks, assessing the impact of different
translation services, and offering a multilingual evaluation framework for LLMs
that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC,
EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly
available to encourage further research in multilingual LLM evaluation.
♻ ☆ Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach EMNLP 2024
Retrieval Augmented Generation (RAG) has been a powerful tool for Large
Language Models (LLMs) to efficiently process overly lengthy contexts. However,
recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to
understand long contexts directly. We conduct a comprehensive comparison
between RAG and long-context (LC) LLMs, aiming to leverage the strengths of
both. We benchmark RAG and LC across various public datasets using three latest
LLMs. Results reveal that when resourced sufficiently, LC consistently
outperforms RAG in terms of average performance. However, RAG's significantly
lower cost remains a distinct advantage. Based on this observation, we propose
Self-Route, a simple yet effective method that routes queries to RAG or LC
based on model self-reflection. Self-Route significantly reduces the
computation cost while maintaining a comparable performance to LC. Our findings
provide a guideline for long-context applications of LLMs using RAG and LC.
comment: Accepted to EMNLP 2024 industry track
♻ ☆ Many-Shot In-Context Learning NeurIPS
Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle
Large language models (LLMs) excel at few-shot in-context learning (ICL) --
learning from a few examples provided in context at inference, without any
weight updates. Newly expanded context windows allow us to investigate ICL with
hundreds or thousands of examples -- the many-shot regime. Going from few-shot
to many-shot, we observe significant performance gains across a wide variety of
generative and discriminative tasks. While promising, many-shot ICL can be
bottlenecked by the available amount of human-generated examples. To mitigate
this limitation, we explore two new settings: Reinforced and Unsupervised ICL.
Reinforced ICL uses model-generated chain-of-thought rationales in place of
human examples. Unsupervised ICL removes rationales from the prompt altogether,
and prompts the model only with domain-specific questions. We find that both
Reinforced and Unsupervised ICL can be quite effective in the many-shot regime,
particularly on complex reasoning tasks. Finally, we demonstrate that, unlike
few-shot learning, many-shot learning is effective at overriding pretraining
biases, can learn high-dimensional functions with numerical inputs, and
performs comparably to fine-tuning. We also find that inference cost increases
linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL
to varying degrees. Our analysis also reveals the limitations of next-token
prediction loss as an indicator of downstream ICL performance.
comment: NeurIPS (Spotlight)
♻ ☆ Dynamic Topic Language Model on Heterogeneous Children's Mental Health Clinical Notes
Mental health diseases affect children's lives and well-beings which have
received increased attention since the COVID-19 pandemic. Analyzing psychiatric
clinical notes with topic models is critical to evaluating children's mental
status over time. However, few topic models are built for longitudinal
settings, and most existing approaches fail to capture temporal trajectories
for each document. To address these challenges, we develop a dynamic topic
model with consistent topics and individualized temporal dependencies on the
evolving document metadata. Our model preserves the semantic meaning of
discovered topics over time and incorporates heterogeneity among documents. In
particular, when documents can be categorized, we propose a classifier-free
approach to maximize topic heterogeneity across different document groups. We
also present an efficient variational optimization procedure adapted for the
multistage longitudinal setting. In this case study, we apply our method to the
psychiatric clinical notes from a large tertiary pediatric hospital in Southern
California and achieve a 38% increase in the overall coherence of extracted
topics. Our real data analysis reveals that children tend to express more
negative emotions during state shutdowns and more positive when schools reopen.
Furthermore, it suggests that sexual and gender minority (SGM) children display
more pronounced reactions to major COVID-19 events and a greater sensitivity to
vaccine-related news than non-SGM children. This study examines children's
mental health progression during the pandemic and offers clinicians valuable
insights to recognize disparities in children's mental health related to their
sexual and gender identities.
♻ ☆ The Impact of Visual Information in Chinese Characters: Evaluating Large Models' Ability to Recognize and Utilize Radicals
The glyphic writing system of Chinese incorporates information-rich visual
features in each character, such as radicals that provide hints about meaning
or pronunciation. However, there has been no investigation into whether
contemporary Large Language Models (LLMs) and Vision-Language Models (VLMs) can
harness these sub-character features in Chinese through prompting. In this
study, we establish a benchmark to evaluate LLMs' and VLMs' understanding of
visual elements in Chinese characters, including radicals, composition
structures, strokes, and stroke counts. Our results reveal that models
surprisingly exhibit some, but still limited, knowledge of the visual
information, regardless of whether images of characters are provided. To incite
models' ability to use radicals, we further experiment with incorporating
radicals into the prompts for Chinese language processing (CLP) tasks. We
observe consistent improvement in Part-Of-Speech tagging when providing
additional information about radicals, suggesting the potential to enhance CLP
by integrating sub-character information.
♻ ☆ Superlatives in Context: Modeling the Implicit Semantics of Superlatives
Superlatives are used to single out elements with a maximal/minimal property.
Semantically, superlatives perform a set comparison: something (or some things)
has the min/max property out of a set. As such, superlatives provide an ideal
phenomenon for studying implicit phenomena and discourse restrictions. While
this comparison set is often not explicitly defined, its (implicit)
restrictions can be inferred from the discourse context the expression appears
in. In this work we provide an extensive computational study on the semantics
of superlatives. We propose a unified account of superlative semantics which
allows us to derive a broad-coverage annotation schema. Using this unified
schema we annotated a multi-domain dataset of superlatives and their semantic
interpretations. We specifically focus on interpreting implicit or ambiguous
superlative expressions, by analyzing how the discourse context restricts the
set of interpretations. In a set of experiments we then analyze how well models
perform at variations of predicting superlative semantics, with and without
context. We show that the fine-grained semantics of superlatives in context can
be challenging for contemporary models, including GPT-4.
comment: 11 pages
♻ ☆ Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks
In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the
dominant technique for performing natural language tasks, as it does not
require updating the model parameters with gradient-based methods. ICL promises
to "adapt" the LLM to perform the present task at a competitive or
state-of-the-art level at a fraction of the computational cost. ICL can be
augmented by incorporating the reasoning process to arrive at the final label
explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting.
However, recent work has found that ICL relies mostly on the retrieval of task
priors and less so on "learning" to perform tasks, especially for complex
subjective domains like emotion and morality, where priors ossify posterior
predictions. In this work, we examine whether "enabling" reasoning also creates
the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors
that remain relatively unchanged despite the evidence in the prompt. We find
that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL
for larger language models. Code is avalaible at
https://github.com/gchochla/cot-priors.
comment: 5 pages, 2 figures, 1 table. arXiv admin note: text overlap with
arXiv:2403.17125
♻ ☆ Natural Language Processing Methods for the Study of Protein-Ligand Interactions
Recent advances in Natural Language Processing (NLP) have ignited interest in
developing effective methods for predicting protein-ligand interactions (PLIs)
given their relevance to drug discovery and protein engineering efforts and the
ever-growing volume of biochemical sequence and structural data available. The
parallels between human languages and the "languages" used to represent
proteins and ligands have enabled the use of NLP machine learning approaches to
advance PLI studies. In this review, we explain where and how such approaches
have been applied in the recent literature and discuss useful mechanisms such
as long short-term memory, transformers, and attention. We conclude with a
discussion of the current limitations of NLP methods for the study of PLIs as
well as key challenges that need to be addressed in future work.
comment: 52 Pages and 3 Figures
♻ ☆ Can Large Language Models Generate High-quality Patent Claims?
Large language models (LLMs) have shown exceptional performance across
various text generation tasks but remain under-explored in the patent domain,
which offers highly structured and precise language. This paper constructs a
dataset to investigate the performance of current LLMs in patent claim
generation. Our results demonstrate that generating claims based on patent
descriptions outperforms previous research relying on abstracts. Interestingly,
current patent-specific LLMs perform much worse than state-of-the-art general
LLMs, highlighting the necessity for future research on in-domain LLMs. We also
find that LLMs can produce high-quality first independent claims, but their
performances markedly decrease for subsequent dependent claims. Moreover,
fine-tuning can enhance the completeness of inventions' features, conceptual
clarity, and feature linkage. Among the tested LLMs, GPT-4 demonstrates the
best performance in comprehensive human evaluations by patent experts, with
better feature coverage, conceptual clarity, and technical coherence. Despite
these capabilities, comprehensive revision and modification are still necessary
to pass rigorous patent scrutiny and ensure legal robustness.
comment: 16 pages, 2 figures, 12 tables
♻ ☆ Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas EMNLP 2024
Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle H. Ungar, Brenda Curtis
Large language models (LLMs) are increasingly being used in human-centered
social scientific tasks, such as data annotation, synthetic data creation, and
engaging in dialog. However, these tasks are highly subjective and dependent on
human factors, such as one's environment, attitudes, beliefs, and lived
experiences. Thus, it may be the case that employing LLMs (which do not have
such human factors) in these tasks results in a lack of variation in data,
failing to reflect the diversity of human experiences. In this paper, we
examine the role of prompting LLMs with human-like personas and asking the
models to answer as if they were a specific human. This is done explicitly,
with exact demographics, political beliefs, and lived experiences, or
implicitly via names prevalent in specific populations. The LLM personas are
then evaluated via (1) subjective annotation task (e.g., detecting toxicity)
and (2) a belief generation task, where both tasks are known to vary across
human factors. We examine the impact of explicit vs. implicit personas and
investigate which human factors LLMs recognize and respond to. Results show
that explicit LLM personas show mixed results when reproducing known human
biases, but generally fail to demonstrate implicit biases. We conclude that
LLMs may capture the statistical patterns of how people speak, but are
generally unable to model the complex interactions and subtleties of human
perceptions, potentially limiting their effectiveness in social science
applications.
comment: Accepted at Findings of EMNLP 2024
♻ ☆ uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes
Recent work on distilling Whisper's knowledge into small models using
pseudo-labels shows promising performance while reducing the size by up to
50\%. This results in small, efficient, and dedicated models. However, a
critical step of distillation from pseudo-labels involves filtering
high-quality predictions and using only those during training. This step
requires ground truth labels to compare and filter low-quality examples making
the whole process supervised. In addition to that, the distillation process
requires a large amount of data thereby limiting the ability to distill models
in low-resource settings. To address this challenge, we propose a distillation
framework that does not require any labeled data. Through experimentation, we
show that our best distilled models outperform the teacher model by 5-7 points
in terms of WER compared to those without filtering and are on par with or
perform better than similar supervised data filtering setups. When we scale the
data, our models significantly outperform all zero-shot and supervised models.
We demonstrate that it is possible to distill large Whisper models into
relatively small ones without using any labeled data. Our distilled models are
also 25-50\% more compute- and memory-efficient while maintaining performance
equal to or better than that of the teacher model.
comment: Work in progress
♻ ☆ K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning
Strategic reasoning is a complex yet essential capability for intelligent
agents. It requires Large Language Model (LLM) agents to adapt their strategies
dynamically in multi-agent environments. Unlike static reasoning tasks, success
in these contexts depends on anticipating other agents' beliefs and actions
while continuously adjusting strategies to achieve individual goals. LLMs and
LLM agents often struggle with strategic reasoning due to the absence of a
reasoning framework that enables them to dynamically infer others' perspectives
and adapt to changing environments. Inspired by the Level-K framework from game
theory and behavioral economics, which extends reasoning from simple reactions
to structured strategic depth, we propose a novel framework: "K-Level Reasoning
with Large Language Models (K-R)." This framework employs recursive mechanisms
to enable LLMs to achieve varying levels of strategic depth, allowing agents to
form higher order beliefs - beliefs about others' beliefs. We validate this
framework through rigorous testing on four testbeds: two classical game theory
problems and two social intelligence tasks. The results demonstrate the
advantages of K-R in strategic reasoning. Our work presents the first recursive
implementation of strategic depth in large language models (LLMs). It
establishes a foundation for future research into theory of mind and strategic
reasoning in LLMs.
♻ ☆ Beyond Coarse-Grained Matching in Video-Text Retrieval ACCV 2024
Video-text retrieval has seen significant advancements, yet the ability of
models to discern subtle differences in captions still requires verification.
In this paper, we introduce a new approach for fine-grained evaluation. Our
approach can be applied to existing datasets by automatically generating hard
negative test captions with subtle single-word variations across nouns, verbs,
adjectives, adverbs, and prepositions. We perform comprehensive experiments
using four state-of-the-art models across two standard benchmarks (MSR-VTT and
VATEX) and two specially curated datasets enriched with detailed descriptions
(VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) our
analyses show that the current evaluation benchmarks fall short in detecting a
model's ability to perceive subtle single-word differences, 2) our fine-grained
evaluation highlights the difficulty models face in distinguishing such subtle
variations. To enhance fine-grained understanding, we propose a new baseline
that can be easily combined with current methods. Experiments on our
fine-grained evaluations demonstrate that this approach enhances a model's
ability to understand fine-grained differences.
comment: Accepted to ACCV 2024
♻ ☆ Understanding and Mitigating Language Confusion in LLMs EMNLP 2024
We investigate a surprising limitation of LLMs: their inability to
consistently generate text in a user's desired language. We create the Language
Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically
diverse languages with existing and newly-created English and multilingual
prompts. We evaluate a range of LLMs on monolingual and cross-lingual
generation reflecting practical use cases, finding that Llama Instruct and
Mistral models exhibit high degrees of language confusion and even the
strongest models fail to consistently respond in the correct language. We
observe that base and English-centric instruct models are more prone to
language confusion, which is aggravated by complex prompts and high sampling
temperatures. We find that language confusion can be partially mitigated via
few-shot prompting, multilingual SFT and preference tuning. We release our
language confusion benchmark, which serves as a first layer of efficient,
scalable multilingual evaluation at
https://github.com/for-ai/language-confusion.
comment: EMNLP 2024 Main Conference Camera-ready
♻ ☆ ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models EMNLP 2024
Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah
The high power consumption and latency-sensitive deployments of large
language models (LLMs) have motivated efficiency techniques like quantization
and sparsity. Contextual sparsity, where the sparsity pattern is
input-dependent, is crucial in LLMs because the permanent removal of attention
heads or neurons from LLMs can significantly degrade accuracy. Prior work has
attempted to model contextual sparsity using neural networks trained to predict
activation magnitudes, which can be used to dynamically prune structures with
low predicted activation magnitude. In this paper, we look beyond
magnitude-based pruning criteria to assess attention head and neuron importance
in LLMs. We develop a novel predictor called ShadowLLM, which can shadow the
LLM behavior and enforce better sparsity patterns, resulting in over 15%
improvement in end-to-end accuracy compared to prior methods. In addition,
ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu
framework. These enhancements are validated on Llama-2 and OPT models with up
to 30 billion parameters. Our code is available at
\href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
comment: Accepted to EMNLP 2024 (Main, Long Paper)
♻ ☆ Block-Attention for Efficient RAG
We introduce Block-Attention, an attention mechanism designed to address the
increased inference latency and cost in Retrieval-Augmented Generation (RAG)
scenarios. Traditional approaches often encode the entire context. Instead,
Block-Attention divides retrieved documents into discrete blocks, with each
block independently calculating key-value (KV) states except for the final
block. In RAG scenarios, by defining each passage as a block, Block-Attention
enables us to reuse the KV states of passages that have been seen before,
thereby significantly reducing the latency and the computation overhead during
inference. The implementation of Block-Attention involves block segmentation,
position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention
mechanism. Experiments on four RAG benchmarks demonstrate that after block
fine-tuning, the Block-Attention model achieves performance comparable to
self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance
(62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces
the time to first token (TTFT) and floating point operations (FLOPs) to a very
low level. It only takes 45 ms to output the first token for an input sequence
with a total length of 32K. Compared to the self-attention models, the time
consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%,
respectively.
♻ ☆ Prompt-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression
Muhammad Asif Ali, Zhengping Li, Shu Yang, Keyuan Cheng, Yang Cao, Tianhao Huang, Guimin Hu, Weimin Lyu, Lijie Hu, Lu Yu, Di Wang
Large Language Models (LLMs) have shown exceptional abilities for multiple
different natural language processing tasks. While prompting is a crucial tool
for LLM inference, we observe that there is a significant cost associated with
exceedingly lengthy prompts. Existing attempts to compress lengthy prompts lead
to substandard results in terms of readability/interpretability of the
compressed prompt, with a detrimental impact on prompt utility. To address
this, we propose PromptSAW: Prompt compresSion via Relation AWare graphs, an
effective strategy for prompt compression over task-agnostic and task-aware
prompts. Prompt-SAW uses the prompt's textual information to build a graph and
later extracts key information elements in the graph to come up with the
compressed prompt. We also propose GSM8K-aug, i.e., an extended version of the
existing GSM8K benchmark for task-agnostic prompts in order to provide a
comprehensive evaluation platform. Experimental evaluation using benchmark
datasets shows that prompts compressed by Prompt-SAW are not only better in
terms of readability, but they also outperform the best-performing baseline
models by up to 10.1 and 77.1, respectively, for task-agnostic and task-aware
settings while compressing the original prompt text by 34.9 and 56.7.
comment: 16 pages
♻ ☆ A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences EMNLP 2024
The reasoning abilities of Large Language Models (LLMs) are becoming a
central focus of study in NLP. In this paper, we consider the case of
syllogistic reasoning, an area of deductive reasoning studied extensively in
logic and cognitive psychology. Previous research has shown that pre-trained
LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid
answering that $\textit{no conclusion follows}$, display human-like
difficulties, and struggle with multi-step reasoning. We contribute to this
research line by systematically investigating the effects of chain-of-thought
reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on
syllogistic reasoning, considering syllogisms with conclusions that support or
violate world knowledge, as well as ones with multiple premises. Crucially, we
go beyond the standard focus on accuracy, with an in-depth analysis of the
conclusions generated by the models. Our results suggest that the behavior of
pre-trained LLMs can be explained by heuristics studied in cognitive science
and that both ICL and SFT improve model performance on valid inferences,
although only the latter mitigates most reasoning biases without harming model
consistency.
comment: Accepted to EMNLP 2024 (main conference)
♻ ☆ Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
Text-to-SQL systems empower users to interact with databases using natural
language, automatically translating queries into executable SQL code. However,
their reliance on database schema information for SQL generation exposes them
to significant security vulnerabilities, particularly schema inference attacks
that can lead to unauthorized data access or manipulation. In this paper, we
introduce a novel zero-knowledge framework for reconstructing the underlying
database schema of text-to-SQL models without any prior knowledge of the
database. Our approach systematically probes text-to-SQL models with specially
crafted questions and leverages a surrogate GPT-4 model to interpret the
outputs, effectively uncovering hidden schema elements -- including tables,
columns, and data types. We demonstrate that our method achieves high accuracy
in reconstructing table names, with F1 scores of up to .99 for generative
models and .78 for fine-tuned models, underscoring the severity of schema
leakage risks. Furthermore, we propose a simple protection mechanism for
generative models and empirically show its limitations in mitigating these
attacks.
♻ ☆ BLT: Can Large Language Models Handle Basic Legal Text?
We find that the best publicly available LLMs like GPT-4 and Claude currently
perform poorly on basic legal text handling. This motivates the creation of a
benchmark consisting of examples that lawyers and paralegals would expect LLMs
to handle zero-shot, such as looking up the text at a line of a witness
deposition or at a subsection of a contract. LLMs' poor performance on this
benchmark casts into doubt their reliability as-is for legal practice. However,
fine-tuning on our training set brings even a small model to near-perfect
performance. This benchmark will be useful for fine-tuning LLMs for downstream
legal tasks, as well as for tracking LLMs' reliability as-is for basic legal
tasks.
♻ ☆ Towards Inducing Document-Level Abilities in Standard Multilingual Neural Machine Translation Models
Neural Machine Translation (NMT) models have traditionally used Sinusoidal
Positional Embeddings (PEs), which often struggle to capture long-range
dependencies and are less efficient for handling extended context or
document-level translation tasks. This work addresses the challenge of
transitioning pre-trained NMT models from absolute sinusoidal PEs to relative
PEs, such as Rotary Positional Embeddings (ROPE) and Attention with Linear
Biases (ALIBI), without compromising performance. We demonstrate that
parameter-efficient fine-tuning, using only a small amount of high-quality
data, can successfully facilitate this transition. Experimental results
indicate that switching from sinusoidal to relative PEs results in competitive
translation quality on sentence-level evaluation benchmarks. Additionally,
models trained with ROPE consistently outperform those using ALIBI and
Sinusoidal PEs on document-level benchmarks across both string-based metrics
and qualitative evaluations. Moreover, we find that a small amount of
long-context data in a few languages is sufficient for cross-lingual length
generalization, thereby inducing long-context capabilities.
comment: Under Review
♻ ☆ Granular Privacy Control for Geolocation with Vision Language Models EMNLP 2024
Vision Language Models (VLMs) are rapidly advancing in their capability to
answer information-seeking questions. As these models are widely deployed in
consumer applications, they could lead to new privacy risks due to emergent
abilities to identify people in photos, geolocate images, etc. As we
demonstrate, somewhat surprisingly, current open-source and proprietary VLMs
are very capable image geolocators, making widespread geolocation with VLMs an
immediate privacy risk, rather than merely a theoretical future concern. As a
first step to address this challenge, we develop a new benchmark, GPTGeoChat,
to test the ability of VLMs to moderate geolocation dialogues with users. We
collect a set of 1,000 image geolocation conversations between in-house
annotators and GPT-4v, which are annotated with the granularity of location
information revealed at each turn. Using this new dataset, we evaluate the
ability of various VLMs to moderate GPT-4v geolocation conversations by
determining when too much location information has been revealed. We find that
custom fine-tuned models perform on par with prompted API-based models when
identifying leaked location information at the country or city level; however,
fine-tuning on supervised data appears to be needed to accurately moderate
finer granularities, such as the name of a restaurant or building.
comment: Accepted to EMNLP 2024 main conference
♻ ☆ Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets
The rise of online platforms exacerbated the spread of hate speech, demanding
scalable and effective detection. However, the accuracy of hate speech
detection systems heavily relies on human-labeled data, which is inherently
susceptible to biases. While previous work has examined the issue, the
interplay between the characteristics of the annotator and those of the target
of the hate are still unexplored. We fill this gap by leveraging an extensive
dataset with rich socio-demographic information of both annotators and targets,
uncovering how human biases manifest in relation to the target's attributes.
Our analysis surfaces the presence of widespread biases, which we
quantitatively describe and characterize based on their intensity and
prevalence, revealing marked differences. Furthermore, we compare human biases
with those exhibited by persona-based LLMs. Our findings indicate that while
persona-based LLMs do exhibit biases, these differ significantly from those of
human annotators. Overall, our work offers new and nuanced results on human
biases in hate speech annotations, as well as fresh insights into the design of
AI-driven hate speech detection systems.
♻ ☆ Efficient In-Domain Question Answering for Resource-Constrained Environments
Retrieval Augmented Generation (RAG) is a common method for integrating
external knowledge into pretrained Large Language Models (LLMs) to enhance
accuracy and relevancy in question answering (QA) tasks. However, prompt
engineering and resource efficiency remain significant bottlenecks in
developing optimal and robust RAG solutions for real-world QA applications.
Recent studies have shown success in using fine tuning to address these
problems; in particular, Retrieval Augmented Fine Tuning (RAFT) applied to
smaller 7B models has demonstrated superior performance compared to RAG setups
with much larger models such as GPT-3.5. The combination of RAFT with
parameter-efficient fine tuning (PEFT) techniques, such as Low-Rank Adaptation
(LoRA), promises an even more efficient solution, yet remains an unexplored
area. In this work, we combine RAFT with LoRA to reduce fine tuning and storage
requirements and gain faster inference times while maintaining comparable RAG
performance. This results in a more compute-efficient RAFT, or CRAFT, which is
particularly useful for knowledge-intensive QA tasks in resource-constrained
environments where internet access may be restricted and hardware resources
limited.
comment: 6 pages, 2 tables
♻ ☆ LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding ACL 2024
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu
We present LayerSkip, an end-to-end solution to speed-up inference of large
language models (LLMs). First, during training we apply layer dropout, with low
dropout rates for earlier layers and higher dropout rates for later layers, and
an early exit loss where all transformer layers share the same exit. Second,
during inference, we show that this training recipe increases the accuracy of
early exit at earlier layers, without adding any auxiliary layers or modules to
the model. Third, we present a novel self-speculative decoding solution where
we exit at early layers and verify and correct with remaining layers of the
model. Our proposed self-speculative decoding approach has less memory
footprint than other speculative decoding approaches and benefits from shared
compute and activations of the draft and verification stages. We run
experiments on different Llama model sizes on different types of training:
pretraining from scratch, continual pretraining, finetuning on specific data
domain, and finetuning on specific task. We implement our inference solution
and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x
on coding, and 2.0x on TOPv2 semantic parsing task. We open source our code and
checkpoints at https://github.com/facebookresearch/LayerSkip.
comment: ACL 2024
♻ ☆ Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce
Language is a symbolic capital that affects people's lives in many ways
(Bourdieu, 1977, 1991). It is a powerful tool that accounts for identities,
cultures, traditions, and societies in general. Hence, data in a given language
should be viewed as more than a collection of tokens. Good data collection and
labeling practices are key to building more human-centered and socially aware
technologies. While there has been a rising interest in mid- to low-resource
languages within the NLP community, work in this space has to overcome unique
challenges such as data scarcity and access to suitable annotators. In this
paper, we collect feedback from those directly involved in and impacted by NLP
artefacts for mid- to low-resource languages. We conduct a quantitative and
qualitative analysis of the responses and highlight the main issues related to
(1) data quality such as linguistic and cultural data suitability; and (2) the
ethics of common annotation practices such as the misuse of online community
services. Based on these findings, we make several recommendations for the
creation of high-quality language artefacts that reflect the cultural milieu of
its speakers, while simultaneously respecting the dignity and labor of data
workers.
♻ ☆ LLM-based Cognitive Models of Students with Misconceptions
Accurately modeling student cognition is crucial for developing effective
AI-driven educational technologies. A key challenge is creating realistic
student models that satisfy two essential properties: (1) accurately
replicating specific misconceptions, and (2) correctly solving problems where
these misconceptions are not applicable. This dual requirement reflects the
complex nature of student understanding, where misconceptions coexist with
correct knowledge. This paper investigates whether Large Language Models (LLMs)
can be instruction-tuned to meet this dual requirement and effectively simulate
student thinking in algebra. We introduce MalAlgoPy, a novel Python library
that generates datasets reflecting authentic student solution patterns through
a graph-based representation of algebraic problem-solving. Utilizing MalAlgoPy,
we define and examine Cognitive Student Models (CSMs) - LLMs instruction tuned
to faithfully emulate realistic student behavior. Our findings reveal that LLMs
trained on misconception examples can efficiently learn to replicate errors.
However, the training diminishes the model's ability to solve problems
correctly, particularly for problem types where the misconceptions are not
applicable, thus failing to satisfy second property of CSMs. We demonstrate
that by carefully calibrating the ratio of correct to misconception examples in
the training data - sometimes as low as 0.25 - it is possible to develop CSMs
that satisfy both properties. Our insights enhance our understanding of
AI-based student models and pave the way for effective adaptive learning
systems.
♻ ☆ MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
Stefan Gerd Fritsch, Cennet Oguz, Vitor Fortes Rey, Lala Ray, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz
Human Activity Recognition (HAR) is a longstanding problem in AI with
applications in a broad range of areas, including healthcare, sports and
fitness, security, and more. The performance of HAR in real-world settings is
strongly dependent on the type and quality of the input signal that can be
acquired. Given an unobstructed, high-quality camera view of a scene, computer
vision systems, in particular in conjunction with foundation models, can today
fairly reliably distinguish complex activities. On the other hand, recognition
using modalities such as wearable sensors (which are often more broadly
available, e.g., in mobile phones and smartwatches) is a more difficult
problem, as the signals often contain less information and labeled training
data is more difficult to acquire. To alleviate the need for labeled data, we
introduce our comprehensive Fitness Multimodal Activity Dataset (FiMAD) in this
work, which can be used with the proposed pre-training method MuJo (Multimodal
Joint Feature Space Learning) to enhance HAR performance across various
modalities. FiMAD was created using YouTube fitness videos and contains
parallel video, language, pose, and simulated IMU sensor data. MuJo utilizes
this dataset to learn a joint feature space for these modalities. We show that
classifiers pre-trained on FiMAD can increase the performance on real HAR
datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH. For instance, on
MM-Fit, we achieve an Macro F1-Score of up to 0.855 when fine-tuning on only 2%
of the training data and 0.942 when utilizing the full training set for
classification tasks. We have compared our approach to other self-supervised
ones and showed that, unlike them, ours can consistently improve on the
baseline network performance as well as provide a better data-efficiency.
♻ ☆ Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain EMNLP 2024
Questions are essential tools for acquiring the necessary information to
complete information-seeking tasks. However, large language models (LLMs),
especially open-source models, often perform poorly in generating informative
questions, as measured by expected information gain (EIG). In this paper, we
propose a method to enhance the informativeness of LLM-generated questions in
20-question game dialogues. We sample multiple questions from the same model
(LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG
questions to apply a Direct Preference Optimization (DPO) algorithm. Our
results show that this method produces more effective questions (in terms of
EIG), even in domains different from those used to train the DPO model.
comment: Accepted to EMNLP 2024 (Findings)
♻ ☆ Relay Decoding: Concatenating Large Language Models for Machine Translation
Leveraging large language models for machine translation has demonstrated
promising results. However, it does require the large language models to
possess the capability of handling both the source and target languages in
machine translation. When it is challenging to find large models that support
the desired languages, resorting to continuous learning methods becomes a
costly endeavor. To mitigate these expenses, we propose an innovative approach
called RD (Relay Decoding), which entails concatenating two distinct large
models that individually support the source and target languages. By
incorporating a simple mapping layer to facilitate the connection between these
two models and utilizing a limited amount of parallel data for training, we
successfully achieve superior results in the machine translation task.
Experimental results conducted on the Multi30k and WikiMatrix datasets validate
the effectiveness of our proposed method.
comment: Work in progress
♻ ☆ On the Reliability of Large Language Models to Misinformed and Demographically-Informed Prompts AAAI
Toluwani Aremu, Oluwakemi Akinwehinmi, Chukwuemeka Nwagu, Syed Ishtiaque Ahmed, Rita Orji, Pedro Arnau Del Amo, Abdulmotaleb El Saddik
We investigate and observe the behaviour and performance of Large Language
Model (LLM)-backed chatbots in addressing misinformed prompts and questions
with demographic information within the domains of Climate Change and Mental
Health. Through a combination of quantitative and qualitative methods, we
assess the chatbots' ability to discern the veracity of statements, their
adherence to facts, and the presence of bias or misinformation in their
responses. Our quantitative analysis using True/False questions reveals that
these chatbots can be relied on to give the right answers to these close-ended
questions. However, the qualitative insights, gathered from domain experts,
shows that there are still concerns regarding privacy, ethical implications,
and the necessity for chatbots to direct users to professional services. We
conclude that while these chatbots hold significant promise, their deployment
in sensitive areas necessitates careful consideration, ethical oversight, and
rigorous refinement to ensure they serve as a beneficial augmentation to human
expertise rather than an autonomous solution.
comment: Study conducted between August and December 2023. Under review at
AAAI-AI Magazine. Submitted for archival purposes only
♻ ☆ Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Katherine M. Collins, Najoung Kim, Yonatan Bitton, Verena Rieser, Shayegan Omidshafiei, Yushi Hu, Sherol Chen, Senjuti Dutta, Minsuk Chang, Kimin Lee, Youwei Liang, Georgina Evans, Sahil Singla, Gang Li, Adrian Weller, Junfeng He, Deepak Ramachandran, Krishnamurthy Dj Dvijotham
Human feedback plays a critical role in learning and refining reward models
for text-to-image generation, but the optimal form the feedback should take for
learning an accurate reward function has not been conclusively established.
This paper investigates the effectiveness of fine-grained feedback which
captures nuanced distinctions in image quality and prompt-alignment, compared
to traditional coarse-grained feedback (for example, thumbs up/down or ranking
between a set of options). While fine-grained feedback holds promise,
particularly for systems catering to diverse societal preferences, we show that
demonstrating its superiority to coarse-grained feedback is not automatic.
Through experiments on real and synthetic preference data, we surface the
complexities of building effective models due to the interplay of model choice,
feedback type, and the alignment between human judgment and computational
interpretation. We identify key challenges in eliciting and utilizing
fine-grained feedback, prompting a reassessment of its assumed benefits and
practicality. Our findings -- e.g., that fine-grained feedback can lead to
worse models for a fixed budget, in some settings; however, in controlled
settings with known attributes, fine grained rewards can indeed be more helpful
-- call for careful consideration of feedback attributes and potentially beckon
novel modeling approaches to appropriately unlock the potential value of
fine-grained feedback in-the-wild.
♻ ☆ InferAct: Inferring Safe Actions for LLM-Based Agents Through Preemptive Evaluation and Human Feedback
A crucial requirement for deploying LLM-based agents in real-life
applications is the robustness against risky or even irreversible mistakes.
However, the existing research lacks a focus on preemptive evaluation of
reasoning trajectories performed by LLM agents, leading to a gap in ensuring
safe and reliable operations. To explore better solutions, this paper
introduces InferAct, a novel approach that leverages the belief reasoning
ability of LLMs, grounded in Theory-of-Mind, to proactively detect potential
errors before risky actions are executed (e.g., `buy-now' in automatic online
trading or web shopping). InferAct acts as a human proxy, detecting unsafe
actions and alerting users for intervention, which helps prevent irreversible
risks in time and enhances the actor agent's decision-making process.
Experiments on three widely-used tasks demonstrate the effectiveness of
InferAct, presenting a novel solution for safely developing LLM agents in
environments involving critical decision-making.
♻ ☆ Pyramid-Driven Alignment: Pyramid Principle Guided Integration of Large Language Models and Knowledge Graphs
Large Language Models (LLMs) possess impressive reasoning abilities but are
prone to generating incorrect information, often referred to as hallucinations.
While incorporating external Knowledge Graphs (KGs) can partially mitigate this
issue, existing methods primarily treat KGs as static knowledge repositories,
overlooking the critical disparity between KG and LLM knowledge, and failing to
fully exploit the reasoning capabilities inherent in KGs. To address these
limitations, we propose Pyramid-Driven Alignment (PDA), a novel framework for
seamlessly integrating LLMs with KGs. PDA utilizes Pyramid Principle analysis
to construct a hierarchical pyramid structure. This structure is designed to
reflect the input question and generate more validated deductive knowledge,
thereby enhancing the alignment of LLMs and KGs and ensuring more cohesive
integration. Furthermore, PDA employs a recursive mechanism to harness the
underlying reasoning abilities of KGs, resulting in more accurate knowledge
retrieval for question-answering tasks. Our experimental results reveal a
substantial performance advantage of PDA over state-of-the-art baselines, with
improvements reaching 26.70% and 26.78%.
♻ ☆ Autonomous Agents for Collaborative Task under Information Asymmetry NeurIPS 2024
Wei Liu, Chenxi Wang, Yifei Wang, Zihao Xie, Rennai Qiu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Chen Qian
Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great
progress in solving complex tasks. It performs communication among agents
within the system to collaboratively solve tasks, under the premise of shared
information. However, when agents' collaborations are leveraged to perform
multi-person tasks, a new challenge arises due to information asymmetry, since
each agent can only access the information of its human user. Previous MAS
struggle to complete tasks under this condition. To address this, we propose a
new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems.
In iAgents, the human social network is mirrored in the agent network, where
agents proactively exchange human information necessary for task resolution,
thereby overcoming information asymmetry. iAgents employs a novel agent
reasoning mechanism, InfoNav, to navigate agents' communication toward
effective information exchange. Together with InfoNav, iAgents organizes human
information in a mixed memory to provide agents with accurate and comprehensive
information for exchange. Additionally, we introduce InformativeBench, the
first benchmark tailored for evaluating LLM agents' task-solving ability under
information asymmetry. Experimental results show that iAgents can collaborate
within a social network of 140 individuals and 588 relationships, autonomously
communicate over 30 turns, and retrieve information from nearly 70,000 messages
to complete tasks within 3 minutes.
comment: 32 pages, 12 figures, 6 tables, accepted by NeurIPS 2024, see detail
at https://thinkwee.top/iagents
♻ ☆ MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration
Jinjie Wei, Dingkang Yang, Yanshu Li, Qingyao Xu, Zhaoyu Chen, Mingcheng Li, Yue Jiang, Xiaolu Hou, Lihua Zhang
Large Language Model (LLM)-driven interactive systems currently show
potential promise in healthcare domains. Despite their remarkable capabilities,
LLMs typically lack personalized recommendations and diagnosis analysis in
sophisticated medical applications, causing hallucinations and performance
bottlenecks. To address these challenges, this paper proposes MedAide, an
LLM-based omni medical multi-agent collaboration framework for specialized
healthcare services. Specifically, MedAide first performs query rewriting
through retrieval-augmented generation to accomplish accurate medical intent
understanding. Immediately, we devise a contextual encoder to obtain intent
prototype embeddings, which are used to recognize fine-grained intents by
similarity matching. According to the intent relevance, the activated agents
collaborate effectively to provide integrated decision analysis. Extensive
experiments are conducted on four medical benchmarks with composite intents.
Experimental results from automated metrics and expert doctor evaluations show
that MedAide outperforms current LLMs and improves their medical proficiency
and strategic reasoning.
comment: LLM-based Multi-Agent Collaboration for Medical Applications
♻ ☆ Skeleton: A New Framework for Accelerating Language Models via Task Neuron Localized Prompt Tuning
Prompt tuning methods have shown comparable performance to general training
methods as parameter-efficient fine-tuning (PEFT) methods in various natural
language understanding tasks. However, existing prompt tuning methods still
utilize the entire model architecture even when solving a specific task, which
prevents them from accelerating inference speed during the application
procedure. In this paper, we propose a novel prompt tuning framework called
Skeleton to efficiently utilize a language model in terms of memory and time
complexity for solving various tasks, retaining only task-relevant neurons by
using an explainability method. From our framework, we can efficiently solve
various tasks by using only task-relevant neurons and prepending adequate
task-specific prompt tokens with only a single language model. Experiments
reveal that our method significantly enhances inference efficiency (at most x
1.73 speed up) for various widely used benchmarks, showing comparable
performance to the prompt tuning method. Moreover, our method is applicable
across various transformer-based architectures, confirming its practicality and
scalability.
comment: 11 pages
♻ ☆ LLoCO: Learning Long Contexts Offline EMNLP 2024
Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa
Processing long contexts remains a challenge for large language models (LLMs)
due to the quadratic computational and memory overhead of the self-attention
mechanism and the substantial KV cache sizes during generation. We propose
LLoCO, a novel approach to address this problem by learning contexts offline
through context compression and in-domain parameter-efficient finetuning with
LoRA. Our method enables an LLM to create a concise representation of the
original context and efficiently retrieve relevant information to answer
questions accurately. Our approach extends the effective context window of a 4k
token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on
several long-context question-answering datasets, demonstrating that LLoCO
significantly outperforms in-context learning while using $30\times$ fewer
tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during
inference and $11.52\times$ higher throughput during finetuning, substantially
reduces the cost of long document question answering. This makes it a promising
solution for efficient long context processing. Our code is publicly available
on https://github.com/jeffreysijuntan/lloco.
comment: EMNLP 2024. The first two authors contributed equally to this work
♻ ☆ Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions EMNLP2024
Classification is a core NLP task architecture with many potential
applications. While large language models (LLMs) have brought substantial
advancements in text generation, their potential for enhancing classification
tasks remains underexplored. To address this gap, we propose a framework for
thoroughly investigating fine-tuning LLMs for classification, including both
generation- and encoding-based approaches. We instantiate this framework in
edit intent classification (EIC), a challenging and underexplored
classification task. Our extensive experiments and systematic comparisons with
various training approaches and a representative selection of LLMs yield new
insights into their application for EIC. We investigate the generalizability of
these findings on five further classification tasks. To demonstrate the
proposed methods and address the data shortage for empirical edit analysis, we
use our best-performing EIC model to create Re3-Sci2.0, a new large-scale
dataset of 1,780 scientific document revisions with over 94k labeled edits. The
quality of the dataset is assessed through human evaluation. The new dataset
enables an in-depth empirical study of human editing behavior in academic
writing. We make our experimental framework, models and data publicly
available.
comment: EMNLP2024 Main
♻ ☆ From Measurement Instruments to Data: Leveraging Theory-Driven Synthetic Training Data for Classifying Social Constructs
Computational text classification is a challenging task, especially for
multi-dimensional social constructs. Recently, there has been increasing
discussion that synthetic training data could enhance classification by
offering examples of how these constructs are represented in texts. In this
paper, we systematically examine the potential of theory-driven synthetic
training data for improving the measurement of social constructs. In
particular, we explore how researchers can transfer established knowledge from
measurement instruments in the social sciences, such as survey scales or
annotation codebooks, into theory-driven generation of synthetic data. Using
two studies on measuring sexism and political topics, we assess the added value
of synthetic training data for fine-tuning text classification models. Although
the results of the sexism study were less promising, our findings demonstrate
that synthetic data can be highly effective in reducing the need for labeled
data in political topic classification. With only a minimal drop in
performance, synthetic data allows for substituting large amounts of labeled
data. Furthermore, theory-driven synthetic data performed markedly better than
data generated without conceptual information in mind.
♻ ☆ Pragmatic Competence Evaluation of Large Language Models for the Korean Language
Benchmarks play a significant role in the current evaluation of Large
Language Models (LLMs), yet they often overlook the models' abilities to
capture the nuances of human language, primarily focusing on evaluating
embedded knowledge and technical skills. To address this gap, our study
evaluates how well LLMs understand context-dependent expressions from a
pragmatic standpoint, specifically in Korean. We use both Multiple-Choice
Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs)
assessed by human experts. Our results show that GPT-4 leads with scores of
81.11 in MCQs and 85.69 in OEQs, closely followed by HyperCLOVA X.
Additionally, while few-shot learning generally improves performance,
Chain-of-Thought (CoT) prompting tends to encourage literal interpretations,
which may limit effective pragmatic inference. Our findings highlight the need
for LLMs to better understand and generate language that reflects human
communicative norms.
comment: 38th Pacific Asia Conference on Language, Information and Computation
♻ ☆ LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization
Open-Domain Multi-Document Summarization (ODMDS) is the task of generating
summaries from large document collections in response to user queries. This
task is crucial for efficiently addressing diverse information needs from
users. Traditional retrieve-then-summarize approaches fall short for open-ended
queries in ODMDS tasks. These queries often require broader context than
initially retrieved passages provide, making it challenging to retrieve all
relevant information in a single search. While iterative retrieval methods has
been explored for multi-hop question answering (MQA), it's impractical for
ODMDS due to high latency from repeated LLM inference. Accordingly, we propose
LightPAL, a lightweight passage retrieval method for ODMDS. LightPAL leverages
an LLM to pre-construct a graph representing passage relationships, then
employs random walk during retrieval, avoiding iterative LLM inference.
Experiments demonstrate that LightPAL outperforms naive sparse and pre-trained
dense retrievers in both retrieval and summarization metrics, while achieving
higher efficiency compared to iterative MQA approaches.
comment: 15 pages, 7 figures, 6 tables
♻ ☆ SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models
Text-to-image (T2I) models, such as Stable Diffusion, have exhibited
remarkable performance in generating high-quality images from text descriptions
in recent years. However, text-to-image models may be tricked into generating
not-safe-for-work (NSFW) content, particularly in sexually explicit scenarios.
Existing countermeasures mostly focus on filtering inappropriate inputs and
outputs, or suppressing improper text embeddings, which can block sexually
explicit content (e.g., naked) but may still be vulnerable to adversarial
prompts -- inputs that appear innocent but are ill-intended. In this paper, we
present SafeGen, a framework to mitigate sexual content generation by
text-to-image models in a text-agnostic manner. The key idea is to eliminate
explicit visual representations from the model regardless of the text input. In
this way, the text-to-image model is resistant to adversarial prompts since
such unsafe visual representations are obstructed from within. Extensive
experiments conducted on four datasets and large-scale user studies demonstrate
SafeGen's effectiveness in mitigating sexually explicit content generation
while preserving the high-fidelity of benign images. SafeGen outperforms eight
state-of-the-art baseline methods and achieves 99.4% sexual content removal
performance. Furthermore, our constructed benchmark of adversarial prompts
provides a basis for future development and evaluation of anti-NSFW-generation
methods.
comment: Accepted by ACM CCS 2024. Please cite this paper as "Xinfeng Li,
Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu.
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image
Models. In Proceedings of ACM Conference on Computer and Communications
Security (CCS), 2024."
♻ ☆ SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, Jie Tang
We introduce SpreadsheetBench, a challenging spreadsheet manipulation
benchmark exclusively derived from real-world scenarios, designed to immerse
current large language models (LLMs) in the actual workflow of spreadsheet
users. Unlike existing benchmarks that rely on synthesized queries and
simplified spreadsheet files, SpreadsheetBench is built from 912 real questions
gathered from online Excel forums, which reflect the intricate needs of users.
The associated spreadsheets from the forums contain a variety of tabular data
such as multiple tables, non-standard relational tables, and abundant
non-textual elements. Furthermore, we propose a more reliable evaluation metric
akin to online judge platforms, where multiple spreadsheet files are created as
test cases for each instruction, ensuring the evaluation of robust solutions
capable of handling spreadsheets with varying values. Our comprehensive
evaluation of various LLMs under both single-round and multi-round inference
settings reveals a substantial gap between the state-of-the-art (SOTA) models
and human performance, highlighting the benchmark's difficulty.
comment: Neurips 2024 (Spotlight); Homepage:
https://spreadsheetbench.github.io/
♻ ☆ Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
Wangtao Sun, Chenxiang Zhang, XueYou Zhang, Xuanqing Yu, Ziyang Huang, Pei Chen, Haotian Xu, Shizhu He, Jun Zhao, Kang Liu
Although Large Language Models (LLMs) have demonstrated strong ability, they
are further supposed to be controlled and guided by in real-world scenarios to
be safe, accurate, and intelligent. This demands the possession of capability
of LLMs. However, no prior work has made a clear evaluation of the inferential
rule-following capability of LLMs. Previous studies that try to evaluate the
inferential rule-following capability of LLMs fail to distinguish the
inferential rule-following scenarios from the instruction-following scenarios.
Therefore, this paper first clarifies the concept of inferential rule-following
and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified
range of inferential rule-following abilities. Our experimental results on a
variety of LLMs show that they are still limited in following rules. Our
analysis based on the evaluation results provides insights into the
improvements for LLMs toward a better inferential rule-following intelligent
agent. We further propose Inferential Rule-Following Tuning (IRFT). The
experimental results show that through IRFT, LLMs can learn abstract
rule-following abilities from purely synthetic data and then generalize to
RuleBench. The data and code can be found at:
https://anonymous.4open.science/r/llm-rule-following-B3E3/
♻ ☆ Temporally Consistent Factuality Probing for Large Language Models
The prolific use of Large Language Models (LLMs) as an alternate knowledge
base requires them to be factually consistent, necessitating both correctness
and consistency traits for paraphrased queries. Recently, significant attempts
have been made to benchmark datasets and metrics to evaluate LLMs for these
traits. However, structural simplicity (subject-relation-object) and
contemporary association in their query formulation limit the broader
definition of factuality and consistency. In this study, we introduce TeCFaP, a
novel Temporally Consistent Factuality Probe task to expand the consistent
factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC,
a high-quality dataset of prefix-style English query paraphrases. Subsequently,
we extend the definitions of existing metrics to represent consistent
factuality across temporal dimension. We experiment with a diverse set of LLMs
and find most of them performing poorly on TeCFaP. Next, we propose a novel
solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining
multi-task instruction tuning (MT-IT) with consistent-time-sensitive
reinforcement learning (CTSRL) to improve temporally consistent factuality in
LLMs. Our experiments demonstrate the efficacy of CoTSeLF over several
baselines.
♻ ☆ Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media
Stance detection predicts attitudes towards targets in texts and has gained
attention with the rise of social media. Traditional approaches include
conventional machine learning, early deep neural networks, and pre-trained
fine-tuning models. However, with the evolution of very large pre-trained
language models (VLPLMs) like ChatGPT (GPT-3.5), traditional methods face
deployment challenges. The parameter-free Chain-of-Thought (CoT) approach, not
requiring backpropagation training, has emerged as a promising alternative.
This paper examines CoT's effectiveness in stance detection tasks,
demonstrating its superior accuracy and discussing associated challenges.
comment: arXiv admin note: text overlap with arXiv:2212.14548
♻ ☆ MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation EMNLP2024
Large language models (LLMs) have shown substantial progress in natural
language understanding and generation, proving valuable especially in the
medical field. Despite advancements, challenges persist due to the complexity
and diversity inherent in medical tasks, which can be categorized as
knowledge-intensive tasks and alignment-required tasks. Previous approaches
either ignore the latter task or focus on a minority of tasks and hence lose
generalization. To address these drawbacks, we propose a progressive
fine-tuning pipeline. This pipeline employs a Knowledge Aggregator and a Noise
aggregator to encode diverse knowledge in the first stage and filter out
detrimental information. In the second stage, we drop the Noise Aggregator to
avoid the interference of suboptimal representation and leverage an additional
alignment module optimized towards an orthogonal direction to the knowledge
space to mitigate knowledge forgetting. Based on this two-stage paradigm, we
proposed a Medical LLM through decoupling Clinical Alignment and Knowledge
Aggregation (MedCare), which is designed to achieve state-of-the-art (SOTA)
performance on over 20 medical tasks, as well as SOTA results on specific
medical alignment tasks. Various model sizes of MedCare (1.8B, 7B, 14B) all
demonstrate significant improvements over existing models with similar model
sizes.
comment: EMNLP2024 Findings
♻ ☆ Belief Revision: The Adaptability of Large Language Models Reasoning
The capability to reason from text is crucial for real-world NLP
applications. Real-world scenarios often involve incomplete or evolving data.
In response, individuals update their beliefs and understandings accordingly.
However, most existing evaluations assume that language models (LMs) operate
with consistent information. We introduce Belief-R, a new dataset designed to
test LMs' belief revision ability when presented with new evidence. Inspired by
how humans suppress prior inferences, this task assesses LMs within the newly
proposed delta reasoning ($\Delta R$) framework. Belief-R features sequences of
premises designed to simulate scenarios where additional information could
necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across
diverse prompting strategies and found that LMs generally struggle to
appropriately revise their beliefs in response to new information. Further,
models adept at updating often underperformed in scenarios without necessary
updates, highlighting a critical trade-off. These insights underscore the
importance of improving LMs' adaptiveness to changing information, a step
toward more reliable AI systems.
♻ ☆ Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning
While Large Language Models (LLMs) exhibit remarkable capabilities in
zero-shot and few-shot scenarios, they often require computationally
prohibitive sizes. Conversely, smaller Masked Language Models (MLMs) like BERT
and RoBERTa achieve state-of-the-art results through fine-tuning but struggle
with extending to few-shot and zero-shot settings due to their architectural
constraints. Hence, we propose Statement-Tuning, a technique that models
discriminative tasks as a set of finite statements and trains an encoder model
to discriminate between the potential statements to determine the label. We do
Statement-Tuning on multiple tasks to enable cross-task generalization.
Experimental results demonstrate that Statement-Tuning achieves competitive
performance compared to state-of-the-art LLMs with significantly fewer
parameters. Moreover, the study investigates the impact of several design
choices on few-shot and zero-shot generalization, revealing that
Statement-Tuning can achieve strong performance with modest training data and
benefits from task and statement diversity for unseen task generalizability.
♻ ☆ PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action NeurIPS 2024
As language models (LMs) are widely utilized in personalized communication
scenarios (e.g., sending emails, writing social media posts) and endowed with a
certain level of agency, ensuring they act in accordance with the contextual
privacy norms becomes increasingly critical. However, quantifying the privacy
norm awareness of LMs and the emerging privacy risk in LM-mediated
communication is challenging due to (1) the contextual and long-tailed nature
of privacy-sensitive cases, and (2) the lack of evaluation approaches that
capture realistic application scenarios. To address these challenges, we
propose PrivacyLens, a novel framework designed to extend privacy-sensitive
seeds into expressive vignettes and further into agent trajectories, enabling
multi-level evaluation of privacy leakage in LM agents' actions. We instantiate
PrivacyLens with a collection of privacy norms grounded in privacy literature
and crowdsourced seeds. Using this dataset, we reveal a discrepancy between LM
performance in answering probing questions and their actual behavior when
executing user instructions in an agent setup. State-of-the-art LMs, like GPT-4
and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even
when prompted with privacy-enhancing instructions. We also demonstrate the
dynamic nature of PrivacyLens by extending each seed into multiple trajectories
to red-team LM privacy leakage risk. Dataset and code are available at
https://github.com/SALT-NLP/PrivacyLens.
comment: NeurIPS 2024 Datasets and Benchmarks Track
♻ ☆ Prompt Compression for Large Language Models: A Survey
Leveraging large language models (LLMs) for complex natural language tasks
typically requires long-form prompts to convey detailed requirements and
information, which results in increased memory usage and inference costs. To
mitigate these challenges, multiple efficient methods have been proposed, with
prompt compression gaining significant research interest. This survey provides
an overview of prompt compression techniques, categorized into hard prompt
methods and soft prompt methods. First, the technical approaches of these
methods are compared, followed by an exploration of various ways to understand
their mechanisms, including the perspectives of attention optimization,
Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic
language. We also examine the downstream adaptations of various prompt
compression techniques. Finally, the limitations of current prompt compression
methods are analyzed, and several future directions are outlined, such as
optimizing the compression encoder, combining hard and soft prompts methods,
and leveraging insights from multimodality.
♻ ☆ Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
Many studies have revealed that large language models (LLMs) exhibit uneven
awareness of different contextual positions. Their limited context awareness
can lead to overlooking critical information and subsequent task failures.
While several approaches have been proposed to enhance LLMs' context awareness,
achieving both effectiveness and efficiency remains challenging. In this paper,
for LLMs utilizing RoPE as position embeddings, we introduce a novel method
called "Mixture of In-Context Experts" (MoICE) to address this challenge. MoICE
comprises two key components: a router integrated into each attention head
within LLMs and a lightweight router-only training optimization strategy: (1)
MoICE views each RoPE angle as an `in-context' expert, demonstrated to be
capable of directing the attention of a head to specific contextual positions.
Consequently, each attention head flexibly processes tokens using multiple RoPE
angles dynamically selected by the router to attend to the needed positions.
This approach mitigates the risk of overlooking essential contextual
information. (2) The router-only training strategy entails freezing LLM
parameters and exclusively updating routers for only a few steps. When applied
to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods
across multiple tasks on long context understanding and generation, all while
maintaining commendable inference efficiency.
comment: Accepted by Neurips2024
♻ ☆ Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models EMNLP 2024
Multimodal Large Language Models (MLLMs) extend the capacity of LLMs to
understand multimodal information comprehensively, achieving remarkable
performance in many vision-centric tasks. Despite that, recent studies have
shown that these models are susceptible to jailbreak attacks, which refer to an
exploitative technique where malicious users can break the safety alignment of
the target model and generate misleading and harmful answers. This potential
threat is caused by both the inherent vulnerabilities of LLM and the larger
attack scope introduced by vision input. To enhance the security of MLLMs
against jailbreak attacks, researchers have developed various defense
techniques. However, these methods either require modifications to the model's
internal structure or demand significant computational resources during the
inference phase. Multimodal information is a double-edged sword. While it
increases the risk of attacks, it also provides additional data that can
enhance safeguards. Inspired by this, we propose Cross-modality Information
DEtectoR (CIDER), a plug-and-play jailbreaking detector designed to identify
maliciously perturbed image inputs, utilizing the cross-modal similarity
between harmful queries and adversarial images. CIDER is independent of the
target MLLMs and requires less computation cost. Extensive experimental results
demonstrate the effectiveness and efficiency of CIDER, as well as its
transferability to both white-box and black-box MLLMs.
comment: 12 pages, 9 figures, EMNLP 2024 Findings
♻ ★ Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs ACL 2024
In this paper, we investigate the effectiveness of various LLMs in
interpreting tabular data through different prompting strategies and data
formats. Our analyses extend across six benchmarks for table-related tasks such
as question-answering and fact-checking. We introduce for the first time the
assessment of LLMs' performance on image-based table representations.
Specifically, we compare five text-based and three image-based table
representations, demonstrating the role of representation and prompting on LLM
performance. Our study provides insights into the effective use of LLMs on
table-related tasks.
comment: Accepted to ACL 2024 Findings; Naihao and Zhenjie contributed equally
to the project; Data available at:
https://github.com/dnaihao/Tables-as-Texts-or-Images
♻ ★ CREAM: Consistency Regularized Self-Rewarding Language Models
Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao
Recent self-rewarding large language models (LLM) have successfully applied
LLM-as-a-Judge to iteratively improve the alignment performance without the
need of human annotations for preference data. These methods commonly utilize
the same LLM to act as both the policy model (which generates responses) and
the reward model (which scores and ranks those responses). The ranked responses
are then used as preference pairs to train the LLM via direct alignment
technologies (e.g. DPO). However, it is noteworthy that throughout this
process, there is no guarantee of accuracy in the rewarding and ranking, which
is critical for ensuring accurate rewards and high-quality preference data.
Empirical results from relatively small LLMs (e.g., 7B parameters) also
indicate that improvements from self-rewarding may diminish after several
iterations in certain situations, which we hypothesize is due to accumulated
bias in the reward system. This bias can lead to unreliable preference data for
training the LLM. To address this issue, we first formulate and analyze the
generalized iterative preference fine-tuning framework for self-rewarding
language model. We then introduce the regularization to this generalized
framework to mitigate the overconfident preference labeling in the
self-rewarding process. Based on this theoretical insight, we propose a
Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages
the rewarding consistency across different iterations to regularize the
self-rewarding training, helping the model to learn from more reliable
preference data. With this explicit regularization, our empirical results
demonstrate the superiority of CREAM in improving both reward consistency and
alignment performance. The code is publicly available at
https://github.com/Raibows/CREAM.
♻ ☆ What Matters in Transformers? Not All Attention is Needed
While scaling Transformer-based large language models (LLMs) has demonstrated
promising performance across various tasks, it also introduces redundant
architectures, posing efficiency challenges for real-world deployment. Despite
some recognition of redundancy in LLMs, the variability of redundancy across
different architectures in transformers, such as MLP and Attention layers, is
under-explored. In this work, we investigate redundancy across different
modules within Transformers, including Blocks, MLP, and Attention layers, using
a similarity-based metric. Surprisingly, despite the critical role of attention
layers in distinguishing transformers from other architectures, we found that a
large portion of these layers exhibit excessively high similarity and can be
pruned without degrading performance. For instance, Llama-2-70B achieved a
48.4\% speedup with only a 2.4\% performance drop by pruning half of the
attention layers. Furthermore, by tracing model checkpoints throughout the
training process, we observed that attention layer redundancy is inherent and
consistent across training stages. Additionally, we further propose a method
that jointly drops Attention and MLP layers, allowing us to more aggressively
drop additional layers. For instance, when dropping 31 layers (Attention +
MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our
work provides valuable insights for future network architecture design. The
code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.
comment: 15 pages, 13 figures, 6 tables
♻ ☆ LLM-based Translation Inference with Iterative Bilingual Understanding
The remarkable understanding and generation capabilities of large language
models (LLMs) have greatly improved translation performance. However, incorrect
understanding of the sentence to be translated can degrade translation quality.
To address this issue, we proposed a novel Iterative Bilingual Understanding
Translation (IBUT) method based on the cross-lingual capabilities of LLMs and
the dual characteristics of translation tasks. The cross-lingual capability of
LLMs enables the generation of contextual understanding for both the source and
target languages separately. Furthermore, the dual characteristics allow IBUT
to generate effective cross-lingual feedback, iteratively refining contextual
understanding, thereby reducing errors and improving translation performance.
Experimental results showed that the proposed IBUT outperforms several strong
comparison methods, especially being generalized to multiple domains (e.g.,
news, commonsense, and cultural translation benchmarks).
comment: Work in progress
♻ ★ ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents
Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Chaojun Xiao, Zhiyuan Liu, Ge Yu, Chenyan Xiong
Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to
leverage external knowledge, enhancing their performance on knowledge-intensive
tasks. However, existing RAG models often treat LLMs as passive recipients of
information, which can lead to interference from noisy retrieved content. In
this paper, we introduce ActiveRAG, a multi-agent framework that mimics human
learning behavior to help LLMs actively engage with and learn from retrieved
evidence. ActiveRAG designs a knowledge assimilation agent to form the
knowledge understanding by associating external knowledge with the parametric
memory of LLMs. Then our model employs the thought accommodation agent to
calibrate the internal thought of LLMs for response refinement. Our experiments
show that ActiveRAG achieves a 10\% improvement over vanilla RAG on various
question-answering benchmarks. Further analysis reveals that ActiveRAG
mitigates the impact of noisy retrievals, alleviates conflicts between external
knowledge and parametric memory and improves the self-consistency of LLMs in
answering the question. All data and codes are available at
https://github.com/OpenMatch/ActiveRAG.
♻ ★ RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, Maosong Sun
Retrieval-Augmented Generation (RAG) is a powerful approach that enables
large language models (LLMs) to incorporate external knowledge. However,
evaluating the effectiveness of RAG systems in specialized scenarios remains
challenging due to the high costs of data construction and the lack of suitable
evaluation metrics. This paper introduces RAGEval, a framework designed to
assess RAG systems across diverse scenarios by generating high-quality
documents, questions, answers, and references through a schema-based pipeline.
With a focus on factual accuracy, we propose three novel metrics Completeness,
Hallucination, and Irrelevance to rigorously evaluate LLM-generated responses.
Experimental results show that RAGEval outperforms zero-shot and one-shot
methods in terms of clarity, safety, conformity, and richness of generated
samples. Furthermore, the use of LLMs for scoring the proposed metrics
demonstrates a high level of consistency with human evaluations. RAGEval
establishes a new paradigm for evaluating RAG systems in real-world
applications.
comment: https://github.com/OpenBMB/RAGEval
♻ ☆ Benchmarking LLMs for Translating Classical Chinese Poetry:Evaluating Adequacy, Fluency, and Elegance
Large language models (LLMs) have shown remarkable performance in translation
tasks. However, the increasing demand for high-quality translations that are
not only adequate but also fluent and elegant. To evaluate the extent to which
current LLMs can meet these demands, we introduce a suitable benchmark (PoetMT)
for translating classical Chinese poetry into English. This task requires not
only adequacy in translating culturally and historically significant content
but also a strict adherence to linguistic fluency and poetic elegance. To
overcome the limitations of traditional evaluation metrics, we propose an
automatic evaluation metric based on GPT-4, which better evaluates translation
quality in terms of adequacy, fluency, and elegance. Our evaluation study
reveals that existing large language models fall short in this task. To
evaluate these issues, we propose RAT, a Retrieval-Augmented machine
Translation method that enhances the translation process by incorporating
knowledge related to classical poetry. Our dataset and code will be made
available.
comment: Work in progress
♻ ☆ A Theory for Token-Level Harmonization in Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance
large language models (LLMs). Studies show that while RAG provides valuable
external information (benefit), it may also mislead LLMs (detriment) with noisy
or incorrect retrieved texts. Although many existing methods attempt to
preserve benefit and avoid detriment, they lack a theoretical explanation for
RAG. The benefit and detriment in the next token prediction of RAG remain a
black box that cannot be quantified or compared in an explainable manner, so
existing methods are data-driven, need additional utility evaluators or
post-hoc. This paper takes the first step towards providing a theory to explain
and trade off the benefit and detriment in RAG. First, we model RAG as the
fusion between distribution of LLMs knowledge and distribution of retrieved
texts. Then, we formalize the trade-off between the value of external knowledge
(benefit) and its potential risk of misleading LLMs (detriment) in next token
prediction of RAG by distribution difference in this fusion. Finally, we prove
that the actual effect of RAG on the token, which is the comparison between
benefit and detriment, can be predicted without any training or accessing the
utility of retrieval. Based on our theory, we propose a practical novel method,
Tok-RAG, which achieves collaborative generation between the pure LLM and RAG
at token level to preserve benefit and avoid detriment. Experiments in
real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the
effectiveness of our method and support our theoretical findings.
comment: 25 pages
♻ ☆ Instruction Matters: A Simple yet Effective Task Selection for Optimized Instruction Tuning of Specific Tasks EMNLP 2024
Instruction tuning has been proven effective in enhancing zero-shot
generalization across various tasks and in improving the performance of
specific tasks. For task-specific improvements, strategically selecting and
training on related tasks that provide meaningful supervision is crucial, as
this approach enhances efficiency and prevents performance degradation from
learning irrelevant tasks. In this light, we introduce a simple yet effective
task selection method that leverages instruction information alone to identify
relevant tasks, optimizing instruction tuning for specific tasks. Our method is
significantly more efficient than traditional approaches, which require complex
measurements of pairwise transferability between tasks or the creation of data
samples for the target task. Additionally, by aligning the model with the
unique instructional template style of the meta-dataset, we enhance its ability
to granularly discern relevant tasks, leading to improved overall performance.
Experimental results demonstrate that training on a small set of tasks, chosen
solely based on the instructions, results in substantial improvements in
performance on benchmarks such as P3, Big-Bench, NIV2, and Big-Bench Hard.
Significantly, these improvements surpass those achieved by prior task
selection methods, highlighting the superiority of our approach.
comment: EMNLP 2024 (Camera-ready), by Janghoon Han and Changho Lee, with
equal contribution
♻ ☆ RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models EMNLP 2024
The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has
enhanced medical diagnosis. However, current Med-LVLMs frequently encounter
factual issues, often generating responses that do not align with established
medical facts. Retrieval-Augmented Generation (RAG), which utilizes external
knowledge, can improve the factual accuracy of these models but introduces two
major challenges. First, limited retrieved contexts might not cover all
necessary information, while excessive retrieval can introduce irrelevant and
inaccurate references, interfering with the model's generation. Second, in
cases where the model originally responds correctly, applying RAG can lead to
an over-reliance on retrieved contexts, resulting in incorrect answers. To
address these issues, we propose RULE, which consists of two components. First,
we introduce a provably effective strategy for controlling factuality risk
through the calibrated selection of the number of retrieved contexts. Second,
based on samples where over-reliance on retrieved contexts led to errors, we
curate a preference dataset to fine-tune the model, balancing its dependence on
inherent knowledge and retrieved contexts for generation. We demonstrate the
effectiveness of RULE on medical VQA and report generation tasks across three
datasets, achieving an average improvement of 47.4% in factual accuracy. We
publicly release our benchmark and code in
https://github.com/richard-peng-xia/RULE.
comment: EMNLP 2024 main
♻ ☆ TAIA: Large Language Models are Out-of-Distribution Data Learners NeurIPS
Fine-tuning on task-specific question-answer pairs is a predominant method
for enhancing the performance of instruction-tuned large language models (LLMs)
on downstream tasks. However, in certain specialized domains, such as
healthcare or harmless content generation, it is nearly impossible to obtain a
large volume of high-quality data that matches the downstream distribution. To
improve the performance of LLMs in data-scarce domains with domain-mismatched
data, we re-evaluated the Transformer architecture and discovered that not all
parameter updates during fine-tuning contribute positively to downstream
performance. Our analysis reveals that within the self-attention and
feed-forward networks, only the fine-tuned attention parameters are
particularly beneficial when the training set's distribution does not fully
align with the test set. Based on this insight, we propose an effective
inference-time intervention method: Training All parameters but Inferring with
only Attention (\trainallInfAttn). We empirically validate \trainallInfAttn
using two general instruction-tuning datasets and evaluate it on seven
downstream tasks involving math, reasoning, and knowledge understanding across
LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive
experiments demonstrate that \trainallInfAttn achieves superior improvements
compared to both the fully fine-tuned model and the base model in most
scenarios, with significant performance gains. The high tolerance of
\trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning
and enhances specialized tasks using general data. Code is available in
\url{https://github.com/pixas/TAIA_LLM}.
comment: 29 pages. Accepted as a 2024 NeurIPS paper
♻ ☆ From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks
Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jieping Ye
Large Vision Language Models (LVLMs) achieve great performance on
visual-language reasoning tasks, however, the black-box nature of LVLMs hinders
in-depth research on the reasoning mechanism. As all images need to be
converted into image tokens to fit the input format of large language models
(LLMs) along with natural language prompts, sequential visual representation is
essential to the performance of LVLMs, and the information flow analysis
approach can be an effective tool for determining interactions between these
representations. In this paper, we propose integrating attention analysis with
LLaVA-CAM, concretely, attention scores highlight relevant regions during
forward propagation, while LLaVA-CAM captures gradient changes through backward
propagation, revealing key image features. By exploring the information flow
from the perspective of visual representation contribution, we observe that it
tends to converge in shallow layers but diversify in deeper layers. To validate
our analysis, we conduct comprehensive experiments with truncation strategies
across various LVLMs for visual question answering and image captioning tasks,
and experimental results not only verify our hypothesis but also reveal a
consistent pattern of information flow convergence in the corresponding layers,
and the information flow cliff layer will be different due to different
contexts. The paper's source code can be accessed from
\url{https://github.com/zhangbaijin/From-Redundancy-to-Relevance}
♻ ★ Avoiding Copyright Infringement via Large Language Model Unlearning
Pre-trained Large Language Models (LLMs) have demonstrated remarkable
capabilities but also pose risks by learning and generating copyrighted
material, leading to significant legal and ethical concerns. In real-world
scenarios, model owners need to continuously address copyright infringement as
new requests for content removal emerge at different time points. This leads to
the need for sequential unlearning, where copyrighted content is removed
sequentially as new requests arise. Despite its practical relevance, sequential
unlearning in the context of copyright infringement has not been rigorously
explored in existing literature. To address this gap, we propose Stable
Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted
content from LLMs over multiple time steps. Our approach works by identifying
and removing specific weight updates in the model's parameters that correspond
to copyrighted content. We improve unlearning efficacy by introducing random
labeling loss and ensuring the model retains its general-purpose knowledge by
adjusting targeted parameters. Experimental results show that SSU achieves an
effective trade-off between unlearning efficacy and general-purpose language
abilities, outperforming existing baselines.
♻ ☆ REAL: Response Embedding-based Alignment for LLMs
Aligning large language models (LLMs) to human preferences is a crucial step
in building helpful and safe AI tools, which usually involve training on
supervised datasets. Popular algorithms such as Direct Preference Optimization
rely on pairs of AI-generated responses ranked according to human feedback. The
labeling process is the most labor-intensive and costly part of the alignment
pipeline, and improving its efficiency would have a meaningful impact on AI
development. We propose a strategy for sampling a high-quality training dataset
that focuses on acquiring the most informative response pairs for labeling out
of a set of AI-generated responses. Experimental results on synthetic HH-RLHF
benchmarks indicate that choosing dissimilar response pairs enhances the direct
alignment of LLMs while reducing inherited labeling errors. We also applied our
method to the real-world dataset SHP2, selecting optimal pairs from multiple
responses. The model aligned on dissimilar response pairs obtained the best win
rate on the dialogue task. Our findings suggest that focusing on less similar
pairs can improve the efficiency of LLM alignment, saving up to 65% of
annotators' work.
♻ ☆ Experimental Contexts Can Facilitate Robust Semantic Property Inference in Language Models, but Inconsistently EMNLP 2024
Recent zero-shot evaluations have highlighted important limitations in the
abilities of language models (LMs) to perform meaning extraction. However, it
is now well known that LMs can demonstrate radical improvements in the presence
of experimental contexts such as in-context examples and instructions. How well
does this translate to previously studied meaning-sensitive tasks? We present a
case-study on the extent to which experimental contexts can improve LMs'
robustness in performing property inheritance -- predicting semantic properties
of novel concepts, a task that they have been previously shown to fail on. Upon
carefully controlling the nature of the in-context examples and the
instructions, our work reveals that they can indeed lead to non-trivial
property inheritance behavior in LMs. However, this ability is inconsistent:
with a minimal reformulation of the task, some LMs were found to pick up on
shallow, non-semantic heuristics from their inputs, suggesting that the
computational principles of semantic property inference are yet to be mastered
by LMs.
comment: EMNLP 2024 (main) camera-ready